Second call for participation and papers

Pilot SENSEVAL

An evaluation exercise for word sense disambiguation programs

Workshop: SENSEVAL AND THE LEXICOGRAPHY LOOP

Sept 2-4 1998

Sussex, UK

Sponsored by ACL SIGLEX and EURALEX

Related pages background | participants | Planning/Programme Committee | Workshop Submissions

There are now many automatic Word Sense Disambiguation (WSD) programs but it is currently very hard to determine which are better, which worse, and where the strengths and weaknesses of each lie. There is widespread agreement that the field urgently needs an evaluation framework. Under the auspices of ACL SIGLEX and EURALEX a pilot will take place in the course of 1998. As in ARPA evaluation exercises, the framework comprises:

  1. definition of task and scoring metric
  2. preparation of a set of manually tagged correct answers
  3. a dry run, with sample data distributed to participants
  4. distribution of test data to participants; participants sense-tag and return; taggings scored against correct answers
  5. workshop to discuss results, lessons learned, way forward

We shall be undertaking evaluation for at least English, French, Italian and Spanish. For information on the French and Italian exercises (ROMANSEVAL), click here. For Spanish, mail Evelyne Viegas.

The workshop will be held at Herstmonceux Castle, Sussex, UK., Sept 2-4 1998

If you have a working WSD program (or will have one by Summer 1998), and would like to subject it to objective, quantitative evaluation, or if you have skills or resources that you would like to contribute to the exercise, first look here and then mail your expression of interest to the co-ordinator.

Details of tasks for English

We intend (funding permitting) to run three distinct exercises: one for those who need sense-tagged training data, and two variants for those who do not. In the first variant of the no-training-data task, all the content words in a set of sentences are tagged (the "all-types" task, using WordNet senses, like SEMCOR). In the second variant, tagging is only performed on a few selected words ("lexical-sample" task). In all, three tasks:

Systems that can perform all-types can perform lexical-sample and ones that can perform lexical-sample can perform with-training (assuming appropriate lexicons are available). Inevitably, some algorithms do not neatly fit the categories, with, eg, some algorithms requiring human input for lexicon development, possibly corpus-aided, and others only requiring minimal quantities of training data. All I can say about this is (1) it's as close as we can get to a level playing field, and (2) any comparison of scores must bear it in mind!

There will be no distribution of untagged corpus material of the same genre as that to be used for evaluation. But the evaluation material will be taken from a similar spread of genres to the BNC. Limited downloads of the BNC can be made without a BNC licence here. The BNC is a general-purpose, mixed genre corpus, so various other corpus resources (preferably for British English) would be suitable.

Detailed timetable for each task to follow.

Further Information

background | participants | Planning/Programme Committee | Workshop Submissions


Maintained by Adam Kilgarriff
Last updated 5 March 1998
©University of Brighton

Information Technology Research Institute