Department of Computer Science
Instituto de Matematica e Estatistica
University of Sao Paulo -- Brazil
mfinger@ime.usp.br
Building large annotated corpora, such as is the case of the Tycho Brahe Corpus of Historical Portuguese, is only feasible if we use automatic methods for such tasks as part of speech tagging. The best automatic tools for part of speech tagging described in the literature were developed and tested for English.
However, the morphological richness of Portuguese forces us to use a number of tags several times larger than that used for English. An analysis of the complexity of the algorithm shows a prohibitive inefficiency resulting from the adoption of a much larger number of tags.
In this work, we propose a new, two-step approach for tagging texts of morphologically rich languages. We describe how the design of tags is affected by this method, and how the existing techniques must be adapted to deal with the greater number of tags found in morphologically rich languages.