<University of Brighton logotype>



Lexical Research

An important component of any language processing system is its lexicon: the actual words that the system knows about. The study of lexicons (how they should be organised, what words they should include, what information about those words they should contain etc.) has recently become a very active research area throughout the language engineering community. The work of the lexicon group is focused on three key areas: lexical organisation, lexical engineering and corpora.

Lexical organisation
There are many reasons for being interested in how lexicons are or should be organised. For the theorist, the challenge is to construct a linguistically motivated organisation which concisely captures the `right' generalisations about lexical phenomena. More practically, a well-organised lexicon is easier to understand, maintain and extend. And for the applied language engineer, a well-organised lexicon is the most appropriate basis for the development of `hard-coded' modules that actually do some particular lexical access task. Our research in this area, in collaboration with the University of Sussex, is concerned with fundamental issues of organisation, with a current focus on multilingual lexicons in particular. This work centres on the continuing development and use of lexicon description language DATR, a non-monotonic knowledge representation language designed specifically for lexicons.

Lexical engineering
Real language processing systems need large-scale lexicons. Most recent work on such lexicons has been very generic in nature, directed towards producing a single `universal' lexicon that can be used in a wide range of systems. But it is gradually becoming accepted that to be efficient and effective, a practical lexicon needs to be tuned to a particular application. While the kind of lexical resources developed so far may be useful as `fall-back' lexicons, they are not suitable for the core domain-specific intensive processing of an application. In the SEAL (Structural Enhancement of Automatically-acquired Lexicons) project, we are looking at ways of using current lexical resources as the basis for development of application-specific lexicons (rather than developing them from scratch, as most previous work has). We are developing tools which use techniques such as merging and filtering information from multiple sources, and induction of organisational structure, guided by application-specific data to tune the resulting representations. This research is funded by the EPSRC, for three years from March 1995.

The CONCEDE project is focused more towards a lexicographical view of lexicons. This project aims to develop lexical knowledge bases for six Eastern European languages (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene) drawing on standards developed in the Text Encoding Initiative.

Corpora
Computer-readable text is available as never before. This makes it possible to study language in many new ways. Our research focuses on the characterisation of bodies of text, or corpora , according to: how homogeneous they are and how similar to each other; word- and word-class-frequency distributions, and what they tell us about language structure; automatic and semi-automatic acquisition of lexicons from corpora. We have been using the 100 million word British National Corpus for our investigations, and are currently leading further developments of this major national resource.

For further information, please contact Roger Evans (+44 1273 642902) - see our contact page for full contact details.


Maintained by Roger Evans (Roger.Evans@itri.brighton.ac.uk).
Last updated 20 October 1997

©Information Technology Research Institute

ITRI home page | Generation | Writing support tools | Information extraction