quinta-feira, 14 de julho de 2016

Updating to a new version of morphologik-stemming

My current work in my PhD is to build a framework to perform grammar analysis of Portuguese texts.  LanguageTool and Cogroo are the tools I am using to check style, grammar and spelling errors. LanguageTool is a stable tool developed by the community, therefore it presents a lot of documentation and posts on forums. Cogroo is a recent tool that is the result of a small group, then  documentation sometimes is not complete and its last update was in 2013.

In my attempt to integrate LanguageTool and Cogroo, I faced compatibility issues related to morphologik-stemming version.  LanguageTool and Cogroo employ morphologik-stemming, but the first uses the latest version and the second uses the oldest version. This situation generated a lot compatibility issues that I was unable to solve using traditional methods. Then I decided to update Cogroo tool to the latest version of morphologik-stemming.

To update Cogroo to the latest version of morphologik-stemming, it was not enough only changing pom.xml and methods names. Since version 2.x, morphologik-stemming utilizes a different format of dictionary, and Cogroo gave me an  only old compiled version of morphologik-stemming dictionary. So I needed to recompile Portuguese dictionaries by using jspell.br tool, which I also had to update to latest version of morphologik-stemming dictionary format.

The new version of morphologik-stemming does not use tab2morph anymore. Then, instead you perform two steps like LanguageTool tutorial  points out, you now only need to perform the following one step:

java -jar morfologik-tools-*-standalone.jar fsa_compile -i input.txt -o output.dict

I also had other difficulties with jspell itself, but I handled by ignoring some scripts, and performing some commands by hand/