Development : : spellchecker.library V1.0
Posted By: Tcheko. on 2011/4/12 15:03:30
Antoine Dubourg a.k.a. Tcheko released its latest production : spellchecker.library 1.0

This new library offers spellchecking feature based on a dictionary of words. It uses common and efficient spellchecking algorithms : double metaphone and Levenshtein distance (used in major Office package like OpenOffice, AbiWord...). This first release is aimed mainly at developer as no application currently uses spellchecker.library. 

Currently, this library offers '(s)low' performances : spellchecking is done with a simple lookup table. Further version will include faster spellchecking with the help of btree.library to the price of 'huge' memory consumption. Tested with a french dictionary of more than 330.000 words, spellchecker.library took more than 80MB of memory excluding its usage on low end hardware like Efika.  Some benchmarking were done to evaluate the spellchecking speed : up to 1 000 words per second was achieved on a Mac Mini@1.42 with the btree version of the library.

Still, further version of the library will be designed to handle low memory case to accomodate every kind of hardware. This library also has been wrapped and tested with OWB which offers internal API for spellchecking.

Download : SpellChecker-1.0.lha (14KB)
 
  • Order of the Butterfly
    Order of the Butterfly
    mobydick
    Joined: 2004/2/26
    Posts: 179
    From: Mordor, capita...
    Link corrupted (.library instead of .lha)
    Pegasos II/G4@1GHz, 1 GB RAM, MorphOS 3.9
    Efika MX Smartbook, Ubuntu 12.04
    peguser.narod.ru
  • »2011/4/12 17:44
    Profile Visit Website
  • Yokemate of Keyboards
    Yokemate of Keyboards
    takemehomegrandma
    Joined: 2003/2/24
    Posts: 2720
    From:
    What dictionaries does it use?
    MorphOS is Amiga done right! :-)
    MorphOS NG will be AROS done right! :-)
  • »2011/4/12 18:57
    Profile
  • MorphOS Developer
    itix
    Joined: 2003/2/24
    Posts: 1520
    From: Finland
    Cool. I wonder is it necessary load complete dictionary at once?
  • »2011/4/12 19:07
    Profile
  • Priest of the Order of the Butterfly
    Priest of the Order of the Butterfly
    Tcheko
    Joined: 2003/2/25
    Posts: 538
    From: France
    Quote:

    What dictionaries does it use?

    Any.

    Dictionary must follow this two rules:

    - single word per line delimited by a CR and or LF
    - words sorted alphabetically

    I searched a bit and found many free dictionaries online that meet those requirements.
    Quelque soit le chemin que tu prendras dans la vie, sache que tu auras des ampoules aux pieds.
    -------
    I need to practice my Kung Fu.
  • »2011/4/12 19:42
    Profile Visit Website
  • Priest of the Order of the Butterfly
    Priest of the Order of the Butterfly
    Tcheko
    Joined: 2003/2/25
    Posts: 538
    From: France
    Quote:

    Cool. I wonder is it necessary load complete dictionary at once?


    Accessing the dictionary is only needed when a word can't be matched hash to hash.

    Dictionary could be partially loaded to the price of speed or even not present in RAM at all (for an even greater price of speed I guess). I'll implement this solution as well for really scarce memory case.

    Current design load the whole file. Though, I tried to keep the overhead to the minimum : word size + 1 + 12 bytes per word...
    Quelque soit le chemin que tu prendras dans la vie, sache que tu auras des ampoules aux pieds.
    -------
    I need to practice my Kung Fu.
  • »2011/4/12 20:00
    Profile Visit Website
  • Yokemate of Keyboards
    Yokemate of Keyboards
    takemehomegrandma
    Joined: 2003/2/24
    Posts: 2720
    From:
    Hmm, how does this work? I mean, are there like open source dictionaries available for download? I have never reflected over this, I have simply used whatever Chrome/Firefox/MS Word has put on the table. ;-)

    Could I download a Swedish and English Firefox dictionary from here (which appears to be free for all to use)...

    https://addons.mozilla.org/sv-SE/firefox/language-tools/

    ...and use it in OWB? That would be sweet! :-)

    Quote:

    Dictionary could be partially loaded to the price of speed or even not present in RAM at all (for an even greater price of speed I guess)


    But how much speed penalty are we talking about? Is it even noticable? Maybe low RAM consumption is more crucial in a "leight-weight" system as I picture MorphOS to be?

    Perhaps some "on-the-fly" spellchecking (checking as you type) could be done at low priority, somewhat 'in the background', to not disturb the typing flow/experience, and a "spellcheck the entire document" option could be made available as a complement, where you can spellcheck your entire text *after* you have finished typing it (then you can sit back, relax and wait for it to finish, and it can slow down the computer as much as it wants in order to finish as fast as possible)?

    I really like this news BTW! Good project! :-)
    MorphOS is Amiga done right! :-)
    MorphOS NG will be AROS done right! :-)
  • »2011/4/13 0:29
    Profile
  • Priest of the Order of the Butterfly
    Priest of the Order of the Butterfly
    Tcheko
    Joined: 2003/2/25
    Posts: 538
    From: France
    Quote:

    Hmm, how does this work? I mean, are there like open source dictionaries available for download? I have never reflected over this, I have simply used whatever Chrome/Firefox/MS Word has put on the table. ;-)



    Well. Here is a quick history about this project. I was searching for an opensource spellchecker and just found plain crap and overbloated software all around. No plain C unless you looked at old project. Then I planed instead of porting well known spellchecker (the one from openoffice for example) to write my own spellchecker. Then, the main focus was learning how the others did : this is industrial spying ;) I read a lot of paper on the mater and selected the most promising/efficient algorithms. It ended with those : FNV_Hash, Double Metaphone, Levenshtein Distance (and also binary tree, but this is not strictly necessary for spellchecking). The next step were to shake all this in a single piece of code and see if it could spellcheck properly.

    I needed then to find a dictionary to check real word instead of test words.

    When I am saying old spellcheckers suxx, I truly believe they do and we will see why. Other dictionaries use two files : an affix rules and a word base file. Every word match an affix rule that say all the different way to write it. For ex, the word is stored flat, say 'house', associated affix rule will tell the spellchecker that house can also be written 'houses' (plurial). And so on...

    Then I did a test. I took the french dictionary affix + base files from openoffice and created a file that would contains all the affix rules applied to every words. (For ex: verb manger / present -> mange, manges, mange, mangeons, mangez, mangent ... etc).

    I ended with a single filed with around 3 million words (iirc, it was surprisingly big), regular and correct word but also thousand of incorrect and plain wrong words. When I say plain wrong, it was **really** plain wrong

    I closed the file and ended to seek in this way : current spellchecker using affix+base of word, even if they do well most of the time are plain crap : there is wrong words inside, you can't trust them.

    Old spellchecker was designed ten or fifteen years ago when memory constraint was far more important than nowdays. This affix + base system was designed to save a lot of memory. For example, if you look at the french tongue, verbs are latin based meaning that every infinitive (like verb manger) needs to be declined for every tense : This would have taken too much memory !

    Nowdays, the figure is a bit different. We are less constrained by memory and storing a 300K words dictionary in RAM is not a problem anymore (unless you run some low end hardware like Efika where memory is really scarce). Flat french dictionary I currently use does roughly 3.5MB with a bit over 330K words.

    My library does mostly what other spellchecker software do : check if a word is correct (FNV_Hash), if not search for same sounding words (double metaphone), then extract the closest looking words (levenshtein distance). No magic really.

    All this in a 30KB library. Didn't looked at other spellchecker size...

    If you have plentyful of ram (that's the case for any system running openoffice...), get ride of space saving code legacy. Pure crap.

    Quote:


    Could I download a Swedish and English Firefox dictionary from here (which appears to be free for all to use)...

    https://addons.mozilla.org/sv-SE/firefox/language-tools/

    ...and use it in OWB? That would be sweet! :-)



    No. Those dictionaries are based on *spell project. Those are affix + base word system (.aff and .dic files). Completly unsuitable for spellchecker.library. You need plain flat text file. As i said before, my attempt to flatten those dictionaries gave **poor** results : the final flat dictionary file was filled of incorrect words and was HUGE (3 million words with french with tons of crap in the middle, 30MB file or so...).

    So far, the nicest english suitable dictionary I found : http://free.pages.at/rnbmusiccom/fulldictionary00.zip around 230K words.

    Didn't looked for Swedish...

    Quote:

    Dictionary could be partially loaded to the price of speed or even not present in RAM at all (for an even greater price of speed I guess)


    But how much speed penalty are we talking about? Is it even noticable? Maybe low RAM consumption is more crucial in a "leight-weight" system as I picture MorphOS to be?



    Speed versus Memory. Choose.

    With simple lookup table (330K words around 8MB ram needed), unmatched word (wrongly spelled) needs around 40ms for returning suggestions. That's around 25 uncorrectly spelled words per second.

    With btree (330K words, around 80MB ram needed), unmatched word (wrongly spelled) needs around 4ms for returning suggestions. That's around 250 uncorrectly spelled words per second.

    Memory consumption is ten times the simple lookup algorithm but speed is ten times greater. Choose your hammer :)

    In conclusion, speed has a price : memory.

    Quote:


    Perhaps some "on-the-fly" spellchecking (checking as you type) could be done at low priority, somewhat 'in the background', to not disturb the typing flow/experience, and a "spellcheck the entire document" option could be made available as a complement, where you can spellcheck your entire text *after* you have finished typing it (then you can sit back, relax and wait for it to finish, and it can slow down the computer as much as it wants in order to finish as fast as possible)?



    If you have a bunch of memory, just use btree. Saving CPU cycle is **always** good.
    If you have scarce memory, no other choice than computing 'on the fly'.

    Quote:

    I really like this news BTW! Good project! :-)


    I was really annoyed by this missing feature in MorphOS. We have a tool now for spellchecking but still no application using it. And yes, good project ;)

    If you find a suitable swedish flat dictionary, just share a link. :)
    Quelque soit le chemin que tu prendras dans la vie, sache que tu auras des ampoules aux pieds.
    -------
    I need to practice my Kung Fu.
  • »2011/4/13 8:19
    Profile Visit Website
    • MorphOS Developer
      CISC
      Joined: 2005/8/27
      Posts: 619
      From: the land with ...
      Quote:

      If you have a bunch of memory, just use btree. Saving CPU cycle is **always** good.
      If you have scarce memory, no other choice than computing 'on the fly'.


      May I suggest implementing a variation of Tightly Packed Tries (you won't need to mmap this relative small amount of data (TPTs are typically used for far greater datasets than yours)) for btree? This should be more than fast enough for your use and should use much less memory.


      - CISC
    • »2011/4/13 11:19
      Profile
    • Priest of the Order of the Butterfly
      Priest of the Order of the Butterfly
      Tcheko
      Joined: 2003/2/25
      Posts: 538
      From: France
      Quote:

      Quote:

      If you have a bunch of memory, just use btree. Saving CPU cycle is **always** good.
      If you have scarce memory, no other choice than computing 'on the fly'.


      May I suggest implementing a variation of Tightly Packed Tries (you won't need to mmap this relative small amount of data (TPTs are typically used for far greater datasets than yours)) for btree? This should be more than fast enough for your use and should use much less memory.


      - CISC


      Some details about the current btree usage :

      There is three binary trees used :

      - first tree that use hashed string as a key and data as pointer to the string
      - second tree that use primary metaphone as a key (metaphone are 4 bytes long) and data as a pointer to the string
      - third tree that use secondary metaphone as a key and data as a pointer to the string.

      Fast spellchecking is done like this :

      1 - check if the hash of the word exists with the first tree. If yes, you can assume the word is correctly spelled. Still you can double check with the data pointer to avoid hash collision.
      2 - if the hash isn't found, build the metaphone of the word (4bytes) and just call EnumTreeNodes that returns all the matching node, do it for primary and then secondary.
      3 - apply levenshtein distance for every returned string with distance 1, if no match, extend distance to 2. etc.

      I quickly looked at TPT paper. It could only applies to hash keys it seems.

      Maybe then the hash key could benefits from TPT data structure. What's your thought on the matter ?
      Quelque soit le chemin que tu prendras dans la vie, sache que tu auras des ampoules aux pieds.
      -------
      I need to practice my Kung Fu.
    • »2011/4/13 12:21
      Profile Visit Website
    • MorphOS Developer
      CISC
      Joined: 2005/8/27
      Posts: 619
      From: the land with ...
      Quote:

      1 - check if the hash of the word exists with the first tree. If yes, you can assume the word is correctly spelled. Still you can double check with the data pointer to avoid hash collision.


      You should definitely still check the string. What kind of hash do you use?

      Quote:

      I quickly looked at TPT paper. It could only applies to hash keys it seems.

      Maybe then the hash key could benefits from TPT data structure. What's your thought on the matter ?


      I'm not sure I understood what the metaphone trees consist of, surely a metaphone have several results, so what does the pointer point to?

      Anyway I think it should be possible to map the metaphones onto the same data structure, but I'm merely making guesses at this point...


      - CISC
    • »2011/4/13 12:59
      Profile
    • Priest of the Order of the Butterfly
      Priest of the Order of the Butterfly
      Tcheko
      Joined: 2003/2/25
      Posts: 538
      From: France
      Quote:

      Quote:

      1 - check if the hash of the word exists with the first tree. If yes, you can assume the word is correctly spelled. Still you can double check with the data pointer to avoid hash collision.


      You should definitely still check the string. What kind of hash do you use?



      FNV hash. So far, i didn't had any hash collision. I should effectively check the string to be sure through.

      Quote:

      Quote:

      I quickly looked at TPT paper. It could only applies to hash keys it seems.

      Maybe then the hash key could benefits from TPT data structure. What's your thought on the matter ?


      I'm not sure I understood what the metaphone trees consist of, surely a metaphone have several results, so what does the pointer point to?

      Anyway I think it should be possible to map the metaphones onto the same data structure, but I'm merely making guesses at this point...

      - CISC


      Metaphone function returns phonetic approximation of a word in 4 characters.

      I implemented a quick and dirty Trie for hash with linked list, it eats quite a lot of ram in fact. 14MB. Certainly not the best implementation but still less than btree (80MB / 3 = 26MB per tree...). I used exec linked (MinNode, MinList) list which is not the best solution as I do only one way list traversal. I guess I could halve this memory consumption with own linked list implementation. I'll have to look at bitfield too.

      Thanks for this technical (and helpfull) feedback.
      Quelque soit le chemin que tu prendras dans la vie, sache que tu auras des ampoules aux pieds.
      -------
      I need to practice my Kung Fu.
    • »2011/4/13 15:32
      Profile Visit Website
    • Fab
    • MorphOS Developer
      Fab
      Joined: 2003/6/16
      Posts: 1331
      From:
      Quote:



      ...and use it in OWB? That would be sweet! :-)




      And to see the library in real world usage:
      http://fabportnawak.free.fr/temp/marked_errors.png
      http://fabportnawak.free.fr/temp/suggestions.png
    • »2011/4/14 8:51
      Profile Visit Website
    • Priest of the Order of the Butterfly
      Priest of the Order of the Butterfly
      Tcheko
      Joined: 2003/2/25
      Posts: 538
      From: France
      Quote:

      Quote:



      ...and use it in OWB? That would be sweet! :-)




      And to see the library in real world usage:
      http://fabportnawak.free.fr/temp/marked_errors.png
      http://fabportnawak.free.fr/temp/suggestions.png


      Just tested myself. Really promising. :)

      New hash search algorithm suggested by CISC (TPT) has been implemented this morning. 300K hash (that's equal to 300K words dictionary) take only 6MB of RAM instead of 25 (btree). Also the retrieval speed seems far better than btree which means even faster spellchecking speed. Thanks CISC !
      Quelque soit le chemin que tu prendras dans la vie, sache que tu auras des ampoules aux pieds.
      -------
      I need to practice my Kung Fu.
    • »2011/4/14 16:26
      Profile Visit Website
    • Yokemate of Keyboards
      Yokemate of Keyboards
      takemehomegrandma
      Joined: 2003/2/24
      Posts: 2720
      From:
      Quote:

      Quote:



      ...and use it in OWB? That would be sweet! :-)




      And to see the library in real world usage:
      http://fabportnawak.free.fr/temp/marked_errors.png
      http://fabportnawak.free.fr/temp/suggestions.png


      Wow! Very cool! :-)
      MorphOS is Amiga done right! :-)
      MorphOS NG will be AROS done right! :-)
    • »2011/4/15 2:13
      Profile
    • MorphOS Developer
      CISC
      Joined: 2005/8/27
      Posts: 619
      From: the land with ...
      Quote:

      New hash search algorithm suggested by CISC (TPT) has been implemented this morning. 300K hash (that's equal to 300K words dictionary) take only 6MB of RAM instead of 25 (btree). Also the retrieval speed seems far better than btree which means even faster spellchecking speed. Thanks CISC !


      Nice, choosing the right algorithm is usually the hardest part of the job. ;)


      - CISC
    • »2011/4/15 6:39
      Profile
    • Yokemate of Keyboards
      Yokemate of Keyboards
      takemehomegrandma
      Joined: 2003/2/24
      Posts: 2720
      From:
      I think this should really be an integral part of MorphOS, and that every MUI text input box and whatever should make use of it! (And a fast way to change language/dictionary by rightclicking and choose a different one from a context menu, like you do in the browsers in Windows).

      :-)
      MorphOS is Amiga done right! :-)
      MorphOS NG will be AROS done right! :-)
    • »2011/4/15 9:23
      Profile
    • MorphOS Developer
      zukow
      Joined: 2005/2/9
      Posts: 645
      From: Poland
      i thinks spellchecking should be connected to the keyboard layout, i'm typing something in Polish so i want the Polish spellcheck, i'm switching to English so spellchecker should use English words now.
    • »2011/4/15 9:36
      Profile Visit Website
    • Yokemate of Keyboards
      Yokemate of Keyboards
      takemehomegrandma
      Joined: 2003/2/24
      Posts: 2720
      From:
      Quote:

      i thinks spellchecking should be connected to the keyboard layout, i'm typing something in Polish so i want the Polish spellcheck, i'm switching to English so spellchecker should use English words now.


      Hmm, Personally I have a Swedish keyboard, and I always use the Swedish key map, no matter if I write in English or Swedish...
      MorphOS is Amiga done right! :-)
      MorphOS NG will be AROS done right! :-)
    • »2011/4/15 9:49
      Profile
    • MorphOS Developer
      zukow
      Joined: 2005/2/9
      Posts: 645
      From: Poland
      Quote:

      Quote:

      Hmm, Personally I have a Swedish keyboard, and I always use the Swedish key map, no matter if I write in English or Swedish...


      Heh yeah.. I've always thought people use keymaps according their physical keyboard, not the language they're writing :)


      yep, but central/east european countries doesn't have their own keyboards and some of national characters (Polish have ł ó ż ź ę ą ś ć) are overlaying other western european chars.

      If we wanted fast system-wide changing of dictionary (not in particular app) we should add another screenbar plugin or select it in (morphos)prefs.

      I understand that everyone has his pros/cons so i think
      the solution where dictionary is connected to keyboard layout could be implemented as an option.
      Global dictionary would be connected to a language in locales and applications should implement changing of dictionary on his own. For lazy people like me :) dictionary could be connected (on prefs level or on Tokai's keyboard plugin level) to a keyboard layout.
    • »2011/4/16 8:46
      Profile Visit Website
    • Order of the Butterfly
      Order of the Butterfly
      amyren
      Joined: 2010/5/15
      Posts: 219
      From: Norway
      Quote:

      Hmm, how does this work? I mean, are there like open source dictionaries available for download? I have never reflected over this, I have simply used whatever Chrome/Firefox/MS Word has put on the table. ;-)

      Could I download a Swedish and English Firefox dictionary from here (which appears to be free for all to use)...

      https://addons.mozilla.org/sv-SE/firefox/language-tools/

      ...and use it in OWB? That would be sweet! :-)




      Yes, you can use the firefox dictionaries.
      I just downloaded the norwegian (two flavours), swedish, danish, english (british) and english (US) and tested them in OWB.
      They all work fine.
      But you need to convert them into the right format before usage.
      The .xpi files you download is really .zip files, so just rename them to .zip and extract the .dic file.
      Then find a good text editor that have a good find&replace function with wildcards.
      Or you can use excel like I did, and use the import filter to sort the text, use "/" as a separator and then copy column A and paste it into a text file.
      Check the file, and delete the first few lines before saving the file as a base.dic file.
      There might be some issues with the nordic special characters, so be sure to check the text encoding when importing the file into excel. I had no trouble with the norwgian or the swedish files, but for the danish I had to import it as UTF-8 to make it work,
    • »2011/7/29 20:19
      Profile