Semi-automatic annotation of CFGs and treebanks with feature-structure information
Josef van Genabith, Dublin City University

ABSTRACT

In this talk I'll report on joint work with Louisa Sadler (U. of Essex), Anette Frank (XRCE Grenoble) and Andy Way (DCU).

Treebanks which encode higher-level feature structure, in addition to pure phrase structure information, are required as training resources for probabilistic unification grammars and data-driven parsing approaches. Manual construction of such treebanks is labour and cost intensive. As an alternative, one could envisage the construction of new or the scaling-up of existing unification grammars which can then be used to analyze corpora. However, scaling-up and grammar development is labour and cost intensive. What is more, even if a large-coverage grammar is available, typically, for each sentence in an input text it would come up with hundreds or thousands of candidate analyses from which a highly trained expert has to select.

We have developed an alternative method. The basic idea is simple: take an existing treebank, read off the CF-PSG following [Charniak,96], manually annotate it with f-structure annotations, provide macros for the lexical entries and then "reparse" the treebanked trees simply following the original c-structure annotations. During this reparsing process, the f-structure annotations are resolved, and an f-structure is produced. The process is deterministic if the annotations are, and to a large extent costly manual inspection of candidate analyses is avoided. In our current research we further automate the annotation process. In one approach we write annotation templates and compile them out over rule sets extracted from the treebank. In another approach we directly rewrite treebank entries into feature structures using templates. Some of the resources generated can be inspected at
http://www.compapp.dcu.ie/~away/Treebank/treebank.html