quinta-feira, 14 de julho de 2016

Updating to a new version of morphologik-stemming

My current work in my PhD is to build a framework to perform grammar analysis of Portuguese texts.  LanguageTool and Cogroo are the tools I am using to check style, grammar and spelling errors. LanguageTool is a stable tool developed by the community, therefore it presents a lot of documentation and posts on forums. Cogroo is a recent tool that is the result of a small group, then  documentation sometimes is not complete and its last update was in 2013.

In my attempt to integrate LanguageTool and Cogroo, I faced compatibility issues related to morphologik-stemming version.  LanguageTool and Cogroo employ morphologik-stemming, but the first uses the latest version and the second uses the oldest version. This situation generated a lot compatibility issues that I was unable to solve using traditional methods. Then I decided to update Cogroo tool to the latest version of morphologik-stemming.

To update Cogroo to the latest version of morphologik-stemming, it was not enough only changing pom.xml and methods names. Since version 2.x, morphologik-stemming utilizes a different format of dictionary, and Cogroo gave me an  only old compiled version of morphologik-stemming dictionary. So I needed to recompile Portuguese dictionaries by using jspell.br tool, which I also had to update to latest version of morphologik-stemming dictionary format.

The new version of morphologik-stemming does not use tab2morph anymore. Then, instead you perform two steps like LanguageTool tutorial  points out, you now only need to perform the following one step:

java -jar morfologik-tools-*-standalone.jar fsa_compile -i input.txt -o output.dict

I also had other difficulties with jspell itself, but I handled by ignoring some scripts, and performing some commands by hand/

segunda-feira, 6 de junho de 2016

NLTK and StanfordParser

Recently I faced a issue when I was working with NLTK and StanfordParser. According to the most recent documentation, which can be found here, it is enough you add Stanford Parser jar's to your CLASSPATH. However, this is is what happened to me:

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
at edu.stanford.nlp.parser.common.ParserGrammar.(ParserGrammar.java:46)
Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more

Traceback (most recent call last):
  File "test_stanford.py", line 23, in
    sentences = dp.parse_sents(("Hello, My name is Evelin.", "What is your name?"))
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py", line 129, in parse_sents
    cmd, '\n'.join(' '.join(sentence) for sentence in sentences), verbose))
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py", line 225, in _execute
    stdout=PIPE, stderr=PIPE)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/internals.py", line 135, in java
    raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed :...

So, I investigated this issue in so many Stackoverflow posts, NLTK documentation, and StanfordParser documentation. Nevertheless, most of information I collected was about earlier version of NLTK and Stanford earlier versions. After some debugging, I found out that one important file wasn't in Java command of NLTK library. This file is slf4j-api.jar.

Now, let's start from the beginning.  First let's see how is code for use StanfordParser with NLTK. you'll need the following imports in your code:

from nltk.parse.stanford import StanfordDependencyParser

import os

os library is to set environment variables STANFORD_PARSER and CLASSPATH. You can do this like in the following code lines:

os.environ['STANFORD_PARSER'] = '/path/to/your/stanford-parser/unzip/directory'

os.environ['CLASSPATH'] = '/path/to/your/stanford-parser/unzip/directory/'

After that you can instantiate a StanfordDependencyParser like that:

dp = StanfordDependencyParser(model_path='/path/to/your/englishPCFG.ser.gz')

File englishPCFG.ser.gz is inside stanford-parser-3.6.0-models.jar, so you can extract your model file from this jar. 

Finally, you are able to parse your sentences using the next line of code.

sentences = dp.parse_sents(("Hello, My name is Evelin.", "What is your name?"))

Then, I got the error that I pasted in the beginning os this post. So, what did I do? I debugged the following two files of the NLTK API: /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/internals.py and /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py. I saw that the java command built by NLTK API consider in classpath only stanford-parser-3.6.0-models.jar and stanford-parser.jar, but we also need slf4j-api.jar to execute StanfordParser. I tried to set CLASSPATH, but that didn't work for me, so I actually changed stanford.py code. I added the following lines to code:

_MAIN_JAR = r'slf4j-api\.jar' # right after _JAR variable set up

# this code goes right before line: self._classpath = (stanford_jar, model_jar)
main_jar=max(
              find_jar_iter(
                  self._MAIN_JAR, path_to_models_jar,
                  env_vars=('STANFORD_MODELS', 'STANFORD_CORENLP'),
                  searchpath=(), url=_stanford_url,
                  verbose=verbose, is_regex=True
              ),
              key=lambda model_name: re.match(self._MAIN_JAR, model_name)
          )  

# and I changed ...
self._classpath = (stanford_jar, model_jar)
#to...
self._classpath = (stanford_jar, model_jar, main_jar)

I know that it is not a elegant solution, but it worked fine, so for now I think I will use it. Any suggestions are welcomed.