|
|
|
An important component of any language processing system is its
lexicon: the actual words that the system knows about. The study of
lexicons (how they should be organised, what words they should include,
what information about those words they should contain etc.) has
recently become a very active research area throughout the language
engineering community. The work of the lexicon group is focused on three
key areas:
lexical organisation,
lexical engineering and
corpora.
Lexical organisation
There are many reasons for being interested in how lexicons are or
should be organised. For the theorist, the challenge is to construct
a linguistically motivated organisation which concisely captures
the `right' generalisations about lexical phenomena. More practically,
a well-organised lexicon is easier to understand, maintain and
extend. And for the applied language engineer, a well-organised
lexicon is the most appropriate basis for the development of
`hard-coded' modules that actually do some particular lexical access
task. Our research in this area, in collaboration with the
University of Sussex, is concerned with fundamental issues of
organisation, with a current focus on multilingual lexicons in
particular. This work centres on the continuing development and
use of lexicon description language DATR, a non-monotonic
knowledge representation language designed specifically for lexicons.
Lexical engineering
Real language processing systems need large-scale lexicons. Most recent
work on such lexicons has been very generic in nature, directed towards
producing a single `universal' lexicon that can be used in a wide range
of systems. But it is gradually becoming accepted that to be efficient
and effective, a practical lexicon needs to be tuned to a particular
application. While the kind of lexical resources developed so far may be
useful as `fall-back' lexicons, they are not suitable for the core
domain-specific intensive processing of an application. In the SEAL (Structural Enhancement of
Automatically-acquired Lexicons) project, we are looking at ways of
using current lexical resources as the basis for development of
application-specific lexicons (rather than developing them from scratch,
as most previous work has). We are developing tools which use techniques
such as merging and filtering information from multiple sources, and
induction of organisational structure, guided by application-specific
data to tune the resulting representations. This research is funded by
the EPSRC, for three years from March 1995.
The CONCEDE project is focused
more towards a lexicographical view of lexicons. This project aims to
develop lexical knowledge bases for six Eastern European languages
(Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene) drawing on
standards developed in the Text Encoding Initiative.
Corpora
Computer-readable text is available as never before. This makes it
possible to study language in many new ways. Our research focuses on the
characterisation of bodies of text, or corpora , according to: how
homogeneous they are and how similar to each other; word- and
word-class-frequency distributions, and what they tell us about language
structure; automatic and semi-automatic acquisition of lexicons from
corpora. We have been using the 100 million word British National Corpus
for our investigations, and are currently leading further developments
of this major national resource.
For further information, please contact
Roger Evans (+44
1273 642902) - see our contact page for full
contact details.
Maintained by
Roger Evans
(Roger.Evans@itri.brighton.ac.uk).
Last updated 20 October 1997
©Information Technology Research Institute
ITRI home page |
Generation |
Writing support tools |
Information extraction
|