quinta-feira, 22 de março de 2018

Building a Tagger for Twitter Data using spaCy

In this tutorial, I employed python 3.6.4 and spaCy 2.0.9.

According to the spaCy documentation, their word embedding model is based on blogs, news, and comments. Some days ago I decided to exploit spaCy with Twitter data, but NLP programmers know that the domain of the data is an important issue when we are assembling our models. Then I resolved to update spaCy tagger model with Twitter data.

First, I downloaded the data from the Twitter Part-of-Speech Data web site, which is a project of the Carnegie Mellon University. A new POS tag dataset, usually, employs different tags of the Universal Dependencies , then we have to map manually the tags from the new dataset to the POS tags employed by spaCy. The guidelines for the annotation of the Twitter data are in this link. Then I assembled the following new tag_map, which is the first instruction to training a new tagger in spaCy framework. If you do not agree with such mapping, please, feel free to make suggestions in the comments.




However, I faced a tokenizer issue. A tweet is not tokenized like a formal text, since it has hashtag word, links, and so on. Therefore, I customized my tokenizer with the following code. Also, suggestions for customizations are welcomed.

I tried to save this new tokenizer into the disk. However, for some reason, when the to_disk method is used with the Tokenizer object, it only works if we input all the arguments in the constructor step.  Even if you try to use the argument exclude int the to_disk method. So I included the infix_finditer and prefix_search parameters in order to make my to_disk method work.
The remaining of the code is straightforward. I followed the spaCy tutorial web site to update my model. Then I got my Twitter tagger model.

The complete code to train and save the Twitter tagger model using spaCy is in my github.



sexta-feira, 15 de dezembro de 2017

Starting with annotation tools: Brat use case

Annotations tools are important frameworks to address annotation tasks, especially when non-computer science experts are assigned to this kind of task. Afterall, it is difficult to a, for instance, a lawyer to understand a xml or a json file. Thus, I would like to discuss Brat rapid annotation tool (Brat for short), which is a wonderful tool.

However, I faced some problems when dealing with Brat. The first is the format of the file employed by the tool, which is standoff format. Ok, so far so good. My first issue was how I can get a raw text turn into standoff format? After some troubles, I figured out the following step to convert the raw text to the format used by Brat.

1) First information: Brat provides a set of tools in the tool directory. So, if you want to convert from CoNLL-00, CoNLL-02, CoNLL-X, and some other formats, then you have scripts like conll00tostandoff.py, conll02tostandoff.py, conll00tostandoff.py, and so on. Even after I found out these scripts, I had one issue: encoding format. Brat employs python2 as its interpreter and python2 don't deal well with encoding.  So I had to modify the following  lines in the script:

    txt = "T%d\t%s %d %d\t%s" % (idnum, ttype, start, end, text)
    return txt.encode('utf-8')

and

   print >> txtout, doctext.encode('utf-8')

2) But what is the input for my conll02tostandoff.py script? It is a IOB text file. Let's see an example:

The           O
doctor       B-PER
said           O
that           O
he             B-PER
likes         O
Maria       B-PER
Benitez.    I-PER

Then I executed the following command: python2 conll02tostandoff.py test.conll > output.txt

3) It happens that the output of the previous step has two parts. The first part goes to a file with .ann extension, and the second part goes to a file with txt extension. The .ann file holds the annotations for the txt file. It's very simple to split these two parts. Then, you can place your data files in the data folder of Brat.

4) Afte you put the data file in a data directory, you can navigate through your data files. However, you can see that the annotations are displaying an error message, which can be fixed creating the annotation.con file in your data directory. The instructions to build this file are here.

These steps are the basic actions you have to take in order to make Brat minimal functional. For navigating through the files locally you can employ your browser and the standalone server of Brat, which instructions are here.

p.s.: To edit the annotations of your data, you need to be logged in. Otherwise, you only can visualize the annotations. The login you will set up during the stand-alone server start.


terça-feira, 7 de março de 2017

Quick tutorial to NLTK Corpus reader of Conll data

The first step to read Conll data using nltk API reader is import the appropriate library.

from nltk.corpus.reader.conll import ConllCorpusReader

The documentation of NLTK about Conll API describes as the first argument of ConllCorpusReader constructor a root, which means the root directory of your data. So, for instance:

root = "/home/userx/dataconll/"

The next step is to call the constructor of ConllCorpusReader class:

ccorpus = ConllCorpusReader(root, ".conll", ('words', 'pos', 'tree'))

In this example, I want all files with extension ".conll" given root directory. Also, I have to specify which columns I want from my conll files. In this example, I would like to take 'words', 'pos', and 'tree' columns, but you can select the following columns: 'words',  'pos', 'tree', 'chunk', 'ne', 'ignore'. The description of each column is in NLTK documentation website.

After that, you can access the methods for each file you want. For instance:

ccorpus.words('file2.conll')

quinta-feira, 14 de julho de 2016

Updating to a new version of morphologik-stemming

My current work in my PhD is to build a framework to perform grammar analysis of Portuguese texts.  LanguageTool and Cogroo are the tools I am using to check style, grammar and spelling errors. LanguageTool is a stable tool developed by the community, therefore it presents a lot of documentation and posts on forums. Cogroo is a recent tool that is the result of a small group, then  documentation sometimes is not complete and its last update was in 2013.

In my attempt to integrate LanguageTool and Cogroo, I faced compatibility issues related to morphologik-stemming version.  LanguageTool and Cogroo employ morphologik-stemming, but the first uses the latest version and the second uses the oldest version. This situation generated a lot compatibility issues that I was unable to solve using traditional methods. Then I decided to update Cogroo tool to the latest version of morphologik-stemming.

To update Cogroo to the latest version of morphologik-stemming, it was not enough only changing pom.xml and methods names. Since version 2.x, morphologik-stemming utilizes a different format of dictionary, and Cogroo gave me an  only old compiled version of morphologik-stemming dictionary. So I needed to recompile Portuguese dictionaries by using jspell.br tool, which I also had to update to latest version of morphologik-stemming dictionary format.

The new version of morphologik-stemming does not use tab2morph anymore. Then, instead you perform two steps like LanguageTool tutorial  points out, you now only need to perform the following one step:

java -jar morfologik-tools-*-standalone.jar fsa_compile -i input.txt -o output.dict

I also had other difficulties with jspell itself, but I handled by ignoring some scripts, and performing some commands by hand/

segunda-feira, 6 de junho de 2016

NLTK and StanfordParser

Recently I faced a issue when I was working with NLTK and StanfordParser. According to the most recent documentation, which can be found here, it is enough you add Stanford Parser jar's to your CLASSPATH. However, this is is what happened to me:

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
at edu.stanford.nlp.parser.common.ParserGrammar.(ParserGrammar.java:46)
Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more

Traceback (most recent call last):
  File "test_stanford.py", line 23, in
    sentences = dp.parse_sents(("Hello, My name is Evelin.", "What is your name?"))
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py", line 129, in parse_sents
    cmd, '\n'.join(' '.join(sentence) for sentence in sentences), verbose))
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py", line 225, in _execute
    stdout=PIPE, stderr=PIPE)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/internals.py", line 135, in java
    raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed :...

So, I investigated this issue in so many Stackoverflow posts, NLTK documentation, and StanfordParser documentation. Nevertheless, most of information I collected was about earlier version of NLTK and Stanford earlier versions. After some debugging, I found out that one important file wasn't in Java command of NLTK library. This file is slf4j-api.jar.

Now, let's start from the beginning.  First let's see how is code for use StanfordParser with NLTK. you'll need the following imports in your code:

from nltk.parse.stanford import StanfordDependencyParser

import os

os library is to set environment variables STANFORD_PARSER and CLASSPATH. You can do this like in the following code lines:

os.environ['STANFORD_PARSER'] = '/path/to/your/stanford-parser/unzip/directory'

os.environ['CLASSPATH'] = '/path/to/your/stanford-parser/unzip/directory/'

After that you can instantiate a StanfordDependencyParser like that:

dp = StanfordDependencyParser(model_path='/path/to/your/englishPCFG.ser.gz')

File englishPCFG.ser.gz is inside stanford-parser-3.6.0-models.jar, so you can extract your model file from this jar. 

Finally, you are able to parse your sentences using the next line of code.

sentences = dp.parse_sents(("Hello, My name is Evelin.", "What is your name?"))

Then, I got the error that I pasted in the beginning os this post. So, what did I do? I debugged the following two files of the NLTK API: /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/internals.py and /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/parse/stanford.py. I saw that the java command built by NLTK API consider in classpath only stanford-parser-3.6.0-models.jar and stanford-parser.jar, but we also need slf4j-api.jar to execute StanfordParser. I tried to set CLASSPATH, but that didn't work for me, so I actually changed stanford.py code. I added the following lines to code:

_MAIN_JAR = r'slf4j-api\.jar' # right after _JAR variable set up

# this code goes right before line: self._classpath = (stanford_jar, model_jar)
main_jar=max(
              find_jar_iter(
                  self._MAIN_JAR, path_to_models_jar,
                  env_vars=('STANFORD_MODELS', 'STANFORD_CORENLP'),
                  searchpath=(), url=_stanford_url,
                  verbose=verbose, is_regex=True
              ),
              key=lambda model_name: re.match(self._MAIN_JAR, model_name)
          )  

# and I changed ...
self._classpath = (stanford_jar, model_jar)
#to...
self._classpath = (stanford_jar, model_jar, main_jar)

I know that it is not a elegant solution, but it worked fine, so for now I think I will use it. Any suggestions are welcomed.

quarta-feira, 3 de fevereiro de 2010

Nice feature of konsole


For programmers that work with Kde a discovered a nice konsole's feature. Unfortunately my screen is a 13'' laptop =(. But if you type ctrl+shift+t you shrink the console in horizontal and if you type ctrl+shift+l you shrink it in vertical.

For jump to a half konsole to another just type shift+tab. It closes the shrink mode typing ctrl+shift+s.

Of course it is just keyboard shortcuts, but I am really enjoying this feature =)

 

segunda-feira, 1 de fevereiro de 2010

TweetDeck and Debian Amd64

I am anxious to use tweetdeck since my debian 32, but I couldn't install adobe air in my older laptop. So in my laptop I decided to try it again. 

Googling I found many people teaching how-to install tweedeck in Ubuntu and Debian amd64. I had many difficults to use their solutions. It seems that Adobe Air doesn't support linux amd 64bits.  

The first time I tried to excute AdobeAir*.bin. I get this error:

Error loading the runtime (libnss3.so: wrong ELF class: ELFCLASS64)

After some research I found the the right library to install with getlibs.

> getlibs -p libxslt1.1 libnss3-1d

After that I tried to execute AdobeAir*.bin again and I got this error

Error loading the runtime (libnspr4.so: wrong ELF class: ELFCLASS64)

But any command given by blogs in google worked in command line. So I just downloaded the 32 bits package os libnspr4 from debian packages sites and install it with the following command:

getlibs -i libnspr4-0d_4.7.1-5_i386.deb
 It worked. So I could install Adobe air in my laptop.

I hope I helped.

p.s: I am sorry about some english mistakes in writing.