quinta-feira, 22 de março de 2018
Building a Tagger for Twitter Data using spaCy
sexta-feira, 15 de dezembro de 2017
Starting with annotation tools: Brat use case
However, I faced some problems when dealing with Brat. The first is the format of the file employed by the tool, which is standoff format. Ok, so far so good. My first issue was how I can get a raw text turn into standoff format? After some troubles, I figured out the following step to convert the raw text to the format used by Brat.
1) First information: Brat provides a set of tools in the tool directory. So, if you want to convert from CoNLL-00, CoNLL-02, CoNLL-X, and some other formats, then you have scripts like conll00tostandoff.py, conll02tostandoff.py, conll00tostandoff.py, and so on. Even after I found out these scripts, I had one issue: encoding format. Brat employs python2 as its interpreter and python2 don't deal well with encoding. So I had to modify the following lines in the script:
txt = "T%d\t%s %d %d\t%s" % (idnum, ttype, start, end, text)
return txt.encode('utf-8')
and
print >> txtout, doctext.encode('utf-8')
2) But what is the input for my conll02tostandoff.py script? It is a IOB text file. Let's see an example:
The O
doctor B-PER
said O
that O
he B-PER
likes O
Maria B-PER
Benitez. I-PER
Then I executed the following command: python2 conll02tostandoff.py test.conll > output.txt
3) It happens that the output of the previous step has two parts. The first part goes to a file with .ann extension, and the second part goes to a file with txt extension. The .ann file holds the annotations for the txt file. It's very simple to split these two parts. Then, you can place your data files in the data folder of Brat.
4) Afte you put the data file in a data directory, you can navigate through your data files. However, you can see that the annotations are displaying an error message, which can be fixed creating the annotation.con file in your data directory. The instructions to build this file are here.
These steps are the basic actions you have to take in order to make Brat minimal functional. For navigating through the files locally you can employ your browser and the standalone server of Brat, which instructions are here.
p.s.: To edit the annotations of your data, you need to be logged in. Otherwise, you only can visualize the annotations. The login you will set up during the stand-alone server start.
terça-feira, 7 de março de 2017
Quick tutorial to NLTK Corpus reader of Conll data
from nltk.corpus.reader.conll import ConllCorpusReader
The documentation of NLTK about Conll API describes as the first argument of ConllCorpusReader constructor a root, which means the root directory of your data. So, for instance:
root = "/home/userx/dataconll/"
The next step is to call the constructor of ConllCorpusReader class:
ccorpus = ConllCorpusReader(root, ".conll", ('words', 'pos', 'tree'))
In this example, I want all files with extension ".conll" given root directory. Also, I have to specify which columns I want from my conll files. In this example, I would like to take 'words', 'pos', and 'tree' columns, but you can select the following columns: 'words', 'pos', 'tree', 'chunk', 'ne', 'ignore'. The description of each column is in NLTK documentation website.
After that, you can access the methods for each file you want. For instance:
ccorpus.words('file2.conll')
quinta-feira, 14 de julho de 2016
Updating to a new version of morphologik-stemming
In my attempt to integrate LanguageTool and Cogroo, I faced compatibility issues related to morphologik-stemming version. LanguageTool and Cogroo employ morphologik-stemming, but the first uses the latest version and the second uses the oldest version. This situation generated a lot compatibility issues that I was unable to solve using traditional methods. Then I decided to update Cogroo tool to the latest version of morphologik-stemming.
To update Cogroo to the latest version of morphologik-stemming, it was not enough only changing pom.xml and methods names. Since version 2.x, morphologik-stemming utilizes a different format of dictionary, and Cogroo gave me an only old compiled version of morphologik-stemming dictionary. So I needed to recompile Portuguese dictionaries by using jspell.br tool, which I also had to update to latest version of morphologik-stemming dictionary format.
The new version of morphologik-stemming does not use tab2morph anymore. Then, instead you perform two steps like LanguageTool tutorial points out, you now only need to perform the following one step:
java -jar morfologik-tools-*-standalone.jar fsa_compile -i input.txt -o output.dict
I also had other difficulties with jspell itself, but I handled by ignoring some scripts, and performing some commands by hand/
segunda-feira, 6 de junho de 2016
NLTK and StanfordParser
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
at edu.stanford.nlp.parser.common.ParserGrammar.
Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more
Traceback (most recent call last):
File "test_stanford.py", line 23, in
sentences = dp.parse_sents(("Hello, My name is Evelin.", "What is your name?"))
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py", line 129, in parse_sents
cmd, '\n'.join(' '.join(sentence) for sentence in sentences), verbose))
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py", line 225, in _execute
stdout=PIPE, stderr=PIPE)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/internals.py", line 135, in java
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed :...
So, I investigated this issue in so many Stackoverflow posts, NLTK documentation, and StanfordParser documentation. Nevertheless, most of information I collected was about earlier version of NLTK and Stanford earlier versions. After some debugging, I found out that one important file wasn't in Java command of NLTK library. This file is slf4j-api.jar.
Now, let's start from the beginning. First let's see how is code for use StanfordParser with NLTK. you'll need the following imports in your code:
from nltk.parse.stanford import StanfordDependencyParser
import os
os library is to set environment variables STANFORD_PARSER and CLASSPATH. You can do this like in the following code lines:
os.environ['STANFORD_PARSER'] = '/path/to/your/stanford-parser/unzip/directory'
os.environ['CLASSPATH'] = '/path/to/your/stanford-parser/unzip/directory/'
After that you can instantiate a StanfordDependencyParser like that:
dp = StanfordDependencyParser(model_path='/path/to/your/englishPCFG.ser.gz')
File englishPCFG.ser.gz is inside stanford-parser-3.6.0-models.jar, so you can extract your model file from this jar.
Finally, you are able to parse your sentences using the next line of code.
sentences = dp.parse_sents(("Hello, My name is Evelin.", "What is your name?"))
Then, I got the error that I pasted in the beginning os this post. So, what did I do? I debugged the following two files of the NLTK API: /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/internals.py and /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py. I saw that the java command built by NLTK API consider in classpath only stanford-parser-3.6.0-models.jar and stanford-parser.jar, but we also need slf4j-api.jar to execute StanfordParser. I tried to set CLASSPATH, but that didn't work for me, so I actually changed stanford.py code. I added the following lines to code:
_MAIN_JAR = r'slf4j-api\.jar' # right after _JAR variable set up
# this code goes right before line: self._classpath = (stanford_jar, model_jar)
main_jar=max(
find_jar_iter(
self._MAIN_JAR, path_to_models_jar,
env_vars=('STANFORD_MODELS', 'STANFORD_CORENLP'),
searchpath=(), url=_stanford_url,
verbose=verbose, is_regex=True
),
key=lambda model_name: re.match(self._MAIN_JAR, model_name)
)
# and I changed ...
self._classpath = (stanford_jar, model_jar)
#to...
self._classpath = (stanford_jar, model_jar, main_jar)
I know that it is not a elegant solution, but it worked fine, so for now I think I will use it. Any suggestions are welcomed.
quarta-feira, 3 de fevereiro de 2010
Nice feature of konsole

For programmers that work with Kde a discovered a nice konsole's feature. Unfortunately my screen is a 13'' laptop =(. But if you type ctrl+shift+t you shrink the console in horizontal and if you type ctrl+shift+l you shrink it in vertical.
For jump to a half konsole to another just type shift+tab. It closes the shrink mode typing ctrl+shift+s.
Of course it is just keyboard shortcuts, but I am really enjoying this feature =)
segunda-feira, 1 de fevereiro de 2010
TweetDeck and Debian Amd64
I am anxious to use tweetdeck since my debian 32, but I couldn't install adobe air in my older laptop. So in my laptop I decided to try it again.
Googling I found many people teaching how-to install tweedeck in Ubuntu and Debian amd64. I had many difficults to use their solutions. It seems that Adobe Air doesn't support linux amd 64bits.
The first time I tried to excute AdobeAir*.bin. I get this error:
Error loading the runtime (libnss3.so: wrong ELF class: ELFCLASS64)
After some research I found the the right library to install with getlibs.
> getlibs -p libxslt1.1 libnss3-1d
After that I tried to execute AdobeAir*.bin again and I got this error
Error loading the runtime (libnspr4.so: wrong ELF class: ELFCLASS64)
But any command given by blogs in google worked in command line. So I just downloaded the 32 bits package os libnspr4 from debian packages sites and install it with the following command:
getlibs -i libnspr4-0d_4.7.1-5_i386.deb
It worked. So I could install Adobe air in my laptop.
I hope I helped.
p.s: I am sorry about some english mistakes in writing.