In this tutorial, I employed python 3.6.4 and spaCy 2.0.9.
According to the spaCy documentation, their word embedding model is based on blogs, news, and comments. Some days ago I decided to exploit spaCy with Twitter data, but NLP programmers know that the domain of the data is an important issue when we are assembling our models. Then I resolved to update spaCy tagger model with Twitter data.
First, I downloaded the data from the Twitter Part-of-Speech Data web site, which is a project of the Carnegie Mellon University. A new POS tag dataset, usually, employs different tags of the Universal Dependencies , then we have to map manually the tags from the new dataset to the POS tags employed by spaCy. The guidelines for the annotation of the Twitter data are in this link. Then I assembled the following new tag_map, which is the first instruction to training a new tagger in spaCy framework. If you do not agree with such mapping, please, feel free to make suggestions in the comments.
However, I faced a tokenizer issue. A tweet is not tokenized like a formal text, since it has hashtag word, links, and so on. Therefore, I customized my tokenizer with the following code. Also, suggestions for customizations are welcomed.
I tried to save this new tokenizer into the disk. However, for some reason, when the to_disk method is used with the Tokenizer object, it only works if we input all the arguments in the constructor step. Even if you try to use the argument exclude int the to_disk method. So I included the infix_finditer and prefix_search parameters in order to make my to_disk method work.
The remaining of the code is straightforward. I followed the spaCy tutorial web site to update my model. Then I got my Twitter tagger model.
The complete code to train and save the Twitter tagger model using spaCy is in my github.
Nenhum comentário:
Postar um comentário