In this talk I'll report on joint work with Louisa Sadler (U. of Essex), Anette Frank (XRCE Grenoble) and Andy Way (DCU).
Treebanks which encode higher-level feature structure, in addition to pure phrase structure information, are required as training resources for probabilistic unification grammars and data-driven parsing approaches. Manual construction of such treebanks is labour and cost intensive. As an alternative, one could envisage the construction of new or the scaling-up of existing unification grammars which can then be used to analyze corpora. However, scaling-up and grammar development is labour and cost intensive. What is more, even if a large-coverage grammar is available, typically, for each sentence in an input text it would come up with hundreds or thousands of candidate analyses from which a highly trained expert has to select.
We have developed an alternative method. The basic idea is simple:
take an existing treebank, read off the CF-PSG following
[Charniak,96], manually annotate it with f-structure annotations,
provide macros for the lexical entries and then "reparse" the
treebanked trees simply following the original c-structure
annotations. During this reparsing process, the f-structure
annotations are resolved, and an f-structure is produced. The process
is deterministic if the annotations are, and to a large extent costly
manual inspection of candidate analyses is avoided. In our current
research we further automate the annotation process. In one approach
we write annotation templates and compile them out over rule sets
extracted from the treebank. In another approach we directly rewrite
treebank entries into feature structures using templates. Some of the
resources generated can be inspected at
http://www.compapp.dcu.ie/~away/Treebank/treebank.html