Monday, August 3, 2009

Algorithmic Aspects of Shadhinota (English to Bengali Dictionary)



The most crucial part of building a dictionary is to make an efficient searching technology. There are several ways to keep the words ….May be SQL database, XML data or simple text file. I prefered the text files to keep my data. I want to build a simple desktop utility for general users. I don't expect a general user should install a database software in order to use my dictionary. I aint very good at XML stuffs. So one way is open for me, that is to use text files.

But the main problem of using text file is to maintain the efficient algorithm. I used a simple concept for searching the desired words. Here we go now....

At first I used 26 directories for 26 letters of English alphabet. Every directory has some text files...suppose ab.txt, ac.txt,a.txt......etc. Here ab.txt has the words a, absent,abstract.... every words having the prefix of “ab". Now when I enter a word in my dictionary, at first it goes to the corresponding directory according to the first letter and next it goes to the corresponding text file. Now the second problem arises. One word can be used more than one time in a text file. For example:

The pattern of the text file may be like this
Word Meanings

Country Homeland,Motherland,Home..........

…....
…....
…....

Home House, Place of Accomodation ....

My dictionary searches the word and if it finds the word in the file then print the consecutive words and stop printing when it finds new line. But if same word is used more than once the total process get messed. So I need to distinguish the searched words from others. They need to be unique. I simply added a numeric character before every searched word. Now the text file will be like this

Word
Meanings
2Country Homeland,Motherland,Home..........
…....
…....
…....
2Home House, Place of Accomodation ....

Now every searched word has a unique identity. There could be approximate 26*25 files in 26 directories. A tree like hierarchy will be made and searching the words will be done via the traversing the file hierarchy.

No comments: