Priest of the Order of the Butterfly
Joined: 2003/2/25
Posts: 538
From: France
Quote:Hmm, how does this work? I mean, are there like open source dictionaries available for download? I have never reflected over this, I have simply used whatever Chrome/Firefox/MS Word has put on the table.
Well. Here is a quick history about this project. I was searching for an opensource spellchecker and just found plain crap and overbloated software all around. No plain C unless you looked at old project. Then I planed instead of porting well known spellchecker (the one from openoffice for example) to write my own spellchecker. Then, the main focus was learning how the others did : this is industrial spying ;) I read a lot of paper on the mater and selected the most promising/efficient algorithms. It ended with those : FNV_Hash, Double Metaphone, Levenshtein Distance (and also binary tree, but this is not strictly necessary for spellchecking). The next step were to shake all this in a single piece of code and see if it could spellcheck properly.
I needed then to find a dictionary to check real word instead of test words.
When I am saying old spellcheckers suxx, I truly believe they do and we will see why. Other dictionaries use two files : an affix rules and a word base file. Every word match an affix rule that say all the different way to write it. For ex, the word is stored flat, say 'house', associated affix rule will tell the spellchecker that house can also be written 'houses' (plurial). And so on...
Then I did a test. I took the french dictionary affix + base files from openoffice and created a file that would contains all the affix rules applied to every words. (For ex: verb manger / present -> mange, manges, mange, mangeons, mangez, mangent ... etc).
I ended with a single filed with around 3 million words (iirc, it was surprisingly big), regular and correct word but also thousand of incorrect and plain wrong words. When I say plain wrong, it was **really** plain wrong
I closed the file and ended to seek in this way : current spellchecker using affix+base of word, even if they do well most of the time are plain crap : there is wrong words inside, you can't trust them.
Old spellchecker was designed ten or fifteen years ago when memory constraint was far more important than nowdays. This affix + base system was designed to save a lot of memory. For example, if you look at the french tongue, verbs are latin based meaning that every infinitive (like verb manger) needs to be declined for every tense : This would have taken too much memory !
Nowdays, the figure is a bit different. We are less constrained by memory and storing a 300K words dictionary in RAM is not a problem anymore (unless you run some low end hardware like Efika where memory is really scarce). Flat french dictionary I currently use does roughly 3.5MB with a bit over 330K words.
My library does mostly what other spellchecker software do : check if a word is correct (FNV_Hash), if not search for same sounding words (double metaphone), then extract the closest looking words (levenshtein distance). No magic really.
All this in a 30KB library. Didn't looked at other spellchecker size...
If you have plentyful of ram (that's the case for any system running openoffice...), get ride of space saving code legacy. Pure crap.
Quote:Could I download a Swedish and English Firefox dictionary from here (which appears to be free for all to use)...
https://addons.mozilla.org/sv-SE/firefox/language-tools/
...and use it in OWB? That would be sweet!
No. Those dictionaries are based on *spell project. Those are affix + base word system (.aff and .dic files). Completly unsuitable for spellchecker.library. You need plain flat text file. As i said before, my attempt to flatten those dictionaries gave **poor** results : the final flat dictionary file was filled of incorrect words and was HUGE (3 million words with french with tons of crap in the middle, 30MB file or so...).
So far, the nicest english suitable dictionary I found : http://free.pages.at/rnbmusiccom/fulldictionary00.zip around 230K words.
Didn't looked for Swedish...
Quote:
Dictionary could be partially loaded to the price of speed or even not present in RAM at all (for an even greater price of speed I guess)
But how much speed penalty are we talking about? Is it even noticable? Maybe low RAM consumption is more crucial in a "leight-weight" system as I picture MorphOS to be?