sexta-feira, 15 de dezembro de 2017

Starting with annotation tools: Brat use case

Annotations tools are important frameworks to address annotation tasks, especially when non-computer science experts are assigned to this kind of task. Afterall, it is difficult to a, for instance, a lawyer to understand a xml or a json file. Thus, I would like to discuss Brat rapid annotation tool (Brat for short), which is a wonderful tool.

However, I faced some problems when dealing with Brat. The first is the format of the file employed by the tool, which is standoff format. Ok, so far so good. My first issue was how I can get a raw text turn into standoff format? After some troubles, I figured out the following step to convert the raw text to the format used by Brat.

1) First information: Brat provides a set of tools in the tool directory. So, if you want to convert from CoNLL-00, CoNLL-02, CoNLL-X, and some other formats, then you have scripts like conll00tostandoff.py, conll02tostandoff.py, conll00tostandoff.py, and so on. Even after I found out these scripts, I had one issue: encoding format. Brat employs python2 as its interpreter and python2 don't deal well with encoding.  So I had to modify the following  lines in the script:

    txt = "T%d\t%s %d %d\t%s" % (idnum, ttype, start, end, text)
    return txt.encode('utf-8')

and

   print >> txtout, doctext.encode('utf-8')

2) But what is the input for my conll02tostandoff.py script? It is a IOB text file. Let's see an example:

The           O
doctor       B-PER
said           O
that           O
he             B-PER
likes         O
Maria       B-PER
Benitez.    I-PER

Then I executed the following command: python2 conll02tostandoff.py test.conll > output.txt

3) It happens that the output of the previous step has two parts. The first part goes to a file with .ann extension, and the second part goes to a file with txt extension. The .ann file holds the annotations for the txt file. It's very simple to split these two parts. Then, you can place your data files in the data folder of Brat.

4) Afte you put the data file in a data directory, you can navigate through your data files. However, you can see that the annotations are displaying an error message, which can be fixed creating the annotation.con file in your data directory. The instructions to build this file are here.

These steps are the basic actions you have to take in order to make Brat minimal functional. For navigating through the files locally you can employ your browser and the standalone server of Brat, which instructions are here.

p.s.: To edit the annotations of your data, you need to be logged in. Otherwise, you only can visualize the annotations. The login you will set up during the stand-alone server start.


terça-feira, 7 de março de 2017

Quick tutorial to NLTK Corpus reader of Conll data

The first step to read Conll data using nltk API reader is import the appropriate library.

from nltk.corpus.reader.conll import ConllCorpusReader

The documentation of NLTK about Conll API describes as the first argument of ConllCorpusReader constructor a root, which means the root directory of your data. So, for instance:

root = "/home/userx/dataconll/"

The next step is to call the constructor of ConllCorpusReader class:

ccorpus = ConllCorpusReader(root, ".conll", ('words', 'pos', 'tree'))

In this example, I want all files with extension ".conll" given root directory. Also, I have to specify which columns I want from my conll files. In this example, I would like to take 'words', 'pos', and 'tree' columns, but you can select the following columns: 'words',  'pos', 'tree', 'chunk', 'ne', 'ignore'. The description of each column is in NLTK documentation website.

After that, you can access the methods for each file you want. For instance:

ccorpus.words('file2.conll')