<University of Brighton logotype>



SEAL: Structural Enhancement of Automatically-acquired Lexicons

Overview This project is concerned with the development of practical, large-scale lexicons for use in computer applications that use natural language. While a number of large-scale lexical fragments already exist (hand-crafted or derived from dictionaries or corpora), few of them are practically useful. Reasons for this include insufficient content density (eg syntax but no semantics), inadequate internal structuring, or inappropriate level of detail for the application in hand (too much irrelevant detail can greatly increase the processing load).

Rather than attempting to construct better lexicons from scratch, this project is developing tools which use these existing lexicons as base data, and support the development of new lexicons with greater content density, enhanced structure and application-specific level of detail. This is achieved by merging lexicons together, and by inducing additional structure, guided by insights from lexical representation theory.

The work is being evaluated through two pilot applications in areas related to other research within the Institute, namely text generation and information extraction.

Background The recent commercially-motivated growth in interest in applied computational linguistics (or Language Engineering) has highlighted the need for realistic large-scale linguistic resources, notably grammars and lexicons. Significant research activity has thus been directed towards developing such resources, and a number of large-scale lexical fragments are beginning to emerge (such as ACQUILEX, CELEX, WordNet, XTAG). However these lexicons remain only of fairly limited utility in practical applications. Reasons for this include:
  • insufficient content density - most existing lexicons focus on one particular kind of lexical information (syntax, semantics, orthography etc.) and do not provide enough coverage of other apsects.
  • inadequate internal structuring - lexicons tend to be very `flat', yet structure is important for maintenance and flexibility of representation and access.
  • inappropriate level of detail for the application in hand - too little detail limits the utility of the lexicon, too much detail that is not relevant to the particular application task often has severe processing penalties.
The early expectation of this line of research was that it would ultimately lead to large general purpose lexicons suitable for use in a wide range of applications. To achieve that goal, a significant amount of manual post-processing to rectify the above problems might be tolerable as a one-off enterprise. However, a more recent trend suggests that the single common lexicon approach might not after all be the most effective way forward. Rather, different application areas might be better served by more tailored (but still large-scale) lexicons, each acquired from domain-specific corpora and knowledge sources. This focusses attention on the need for tools to address these problems many times over, in different task domains.

The project The present project is concerned with taking existing lexicons as base data, and producing new lexicons with greater content density, enhanced structure and optimal feature detail (relative to a given domain or task), by merging lexicons and by inducing additional structure. Our approach is motivated by the following assumptions:
  • that theoretical work on lexical description provides valuable insight into the structure of large-scale practical lexicons
  • that the application domain is limited - this keeps the problems manageable, and offers scope for investigation of domain-specific phenomena
The key features of the programme of research are
  • the analysis of existing lexicons and structural comparison with current lexical theory
  • the development of software techniques and subsequently tools for the induction of additional structure in acquired lexicons
  • the evaluation of the methods developed in applications to natural language generation and information extraction
The approach taken is to map existing lexicons into a common notation and develop enhancement techniques in that notation. Both data-driven (inducing additional structure from the data provided) and theory-driven (seeking instances of theoretical proposals in the data) approaches are being explored. Examples of the kind of enhancements we are particularly interested in include the following:
  • construction of optimal inheritance hierarchies, guided by metrics relating to both efficiency (such as maximal generality) and linguistic theory (such as default cases being unmarked)
  • induction of features theory, including feature typing, and properties such as mutual exclusivity, covariation and immutability.
  • representation of lexical entry covariation as abstract lexical relationships

Staff The principal investigator is Roger Evans. Adam Kilgarriff is also working on the project.

Financial Support The project is supported by the EPSRC under grant GR/K/18931.

Publications R.Evans and A. Kilgarriff, ``MRDs, Dictionaries, and How To Do Lexical Engineering'' in Proceedings of the 2nd Language Engineering Convention, pp. 125-132, London, UK, 1995.

A. Kilgarriff, ``Which words are particularly characteristic of a text? A survey of statistical approaches'' in Proceedings, AISB Workshop on Language Engineering for Document Analysis and Recognition. Brighton, UK, 1996.

Adam Kilgarriff and Raphael Salkie ``Corpus similarity and homogeneity via word frequency.'' Proceedings of Euralex '96 Gothenberg, Sweden. 1996.

Adam Kilgarriff ``Putting frequencies in the dictionary.'' International Journal of Lexicography. Forthcoming.


Maintained by Roger Evans (Roger.Evans@itri.brighton.ac.uk).
Last updated 12 January 1997

©Information Technology Research Institute

ITRI home page | ITRI research overview