English SENSEVAL resources
in the public domain

Adam Kilgarriff
ITRI
University of Brighton

Quick access
HECTOR| lexical sample| lexical entries| training corpus| test corpus| gold standard taggings| WordNet mappings| results| bibliography
back to SENSEVAL home page

Introduction

This page is a guide to English-language materials used in the 1998 SENSEVAL Word Sense Disambiguation (WSD) evaluation exercise. Dictionary entries for the 35 selected words, and corpus instances manually sense-tagged according to those dictionary entries, are available. Oxford University Press have kindly agreed to place these materials in the public domain. Thanks are also due to the UK EPSRC who supported the manual re-tagging activity, under grant M03481.

HECTOR

The dictionary and corpus had both been developed in the course of HECTOR, a joint Oxford University Press and Digital project which took place in the early 1990s. In the course of the project a 20-million word corpus (which also served as a pilot for the British National Corpus) was developed. For details of its composition see here.

For a sample of a few hundred words, the OUP lexicographers used the corpus to develop dictionary entries, and in the course of the lexicography, they associated a sense-tag with every instance of the word in the corpus. The rules and strategies they followed in the corpus-tagging and dictionary-writing are described in their Policy Guide.

Lexical sample

The 35 words used in English SENSEVAL, their distribution according to part of speech, and the numbers of test instances associated with each (N), were:

Nouns Verbs Adjectives Indeterminates
-n N -v N -a N -p N
accident 267 amaze 70 brilliant 229 band 302
behaviour 279 bet 177 deaf 122 bitter 373
bet 274 bother 209 floating 47 hurdle 323
disability 160 bury 201 generous 227 sanction 431
excess 186 calculate 217 giant 97 shake 356
float 75 consume 186 modest 270    
giant 118 derive 216 slight 218    
knee 251 float 229 wooden 195    
onion 214 invade 207        
promise 113 promise 224        
rabbit 221 sack 178        
sack 82 scrap 186        
scrap 156 seize 259        
shirt 184            
steering 176            
TOTAL 2756 TOTAL 2501 TOTAL 1406 TOTAL 1785

"Indeterminates" were words for which the task involved determining major word class, as well as determining word sense. For the other tasks, major word class was given.
For word selectin strategy, see bibliography.

Dictionary entries

The dictionary entries are available in SGML (as described in the Policy Guide) and also in postscript, for ease of reading. They are available in SGML, tar'd and gzipped, here. A sample of three items (rabbit, generous and shake) is also available uncompressed here.

A postscipt versions of the full 35-word dictionary is available (gzipped) here.

Test Corpus

The test corpus was the corpus over which the SENSEVAL English evaluation took place. There is one file for each "task", where a task is identified as the word + one of -n, -v, -a, -p depending on whether the items were nouns, verbs, adjectives, or word-class-to-be-determined (see above). For several words there were multiple tasks, eg, sack-n and sack-v. For the 35 words there were 41 tasks. The format for test instances is as below:
700001 
John Dos Passos wrote a poem that talked of `the <tag>bitter</> beat look, the scorn on the lip."   

700002
The beans almost double in size during roasting. 
Black beans are over roasted and will have a <tag>bitter</> flavour and insufficiently roasted beans are pale and give a colourless, tasteless drink. 
The files are available, tar'd and gzipped, here. A sample of three items (rabbit, generous and shake) is also available uncompressed here.

Gold Standard Taggings

The "gold standard" taggings give the sense tag(s) that the human sense-tagging team considered correct for each corpus instance. For each task there is a gold-standard file, with the following format.

700002:kind
700003:unstint or kind
700004:copious
This was drawn from the gold-standard file for generous-a. Each HECTOR sense is associated with a "mnemonic", and kind, copious and unstint are the mnemonics for three of the senses of generous. The sense tag for generous-a test-set instance 700002 is kind, that for 700003 is unstint or kind (human taggers wre permitted to give disjunctive taggings) and that for 700004 is copious.

Each sense had a numerical unique identifier, as well as a mnemonic. The mapping between the two is available here.

The files are available, tar'd and gzipped, here. A sample of three items (rabbit, generous and shake) is also available uncompressed here.

Training corpus

For all but five of the test words, there was material in HECTOR which was not used for the evaluation or re-tagged as part of SENSEVAL, but which had been tagged in the course of HECTOR and was made available to SENSEVAL participants as a training set. The format for these instances is as below.
800004
Mr Purves is tight-lipped about what happens then. 
He recently vexed rumour-mongers, who <tag "520051">bet</> on a bid for Midlan sooner rather than later, by declining to disclose Hongkong Bank's inner reserves when the ban reported its 1989 results on March 13th.   

800005
Of our other leisure facilities, Blackbird Leys Leisure Centre, opened last autumn was given a six-month event-free trial and will now begin to attract the large events and exhibitions to subsidise its running costs. 
Finally, Mr Hugh-Jones loses his <tag "519914?">bet</>: 400,000 people attended Temple Cowley pools last year.  
The files are available, tar'd and gzipped, here. A sample of two items (generous and shake) is also available uncompressed here.

A further dataset in this format is available on request.

Results

An extensive table giving all systems' results for all words is available here (286 kb).

WordNet mappings

Mappings between WordNet senses and HECTOR senses have been produced. (User beware: such mappings are, in general, many-to-many, and there are gaps, and using the mapping involves substantial information-loss.) Mappings are available for WordNet 1.5 and WordNet 1.6.

Bibliography

The paper most fully describing the materials is: which is to appear in a Special SENSEVAL Issue of Computers and the Humanities in early 2000, alongside many other papers on the SENSEVAL exercise (which had tasks for English, French and Italian). Draft version available on request.

For the HECTOR project, see:

A background paper on SENSEVAL:

A paper detailing issues such as sampling strategies

This page is maintained by Adam Kilgarriff and was last updated on 6 July 1999.