HECTOR
The dictionary and corpus had both been developed in the course of
HECTOR, a joint Oxford University Press and Digital project which took
place in the early 1990s. In the course of the project a 20-million
word corpus
(which also served as a pilot for the British National Corpus) was
developed. For details of its composition see
here.
For a sample of a few hundred words, the OUP lexicographers used the corpus to develop dictionary entries, and in the course of the lexicography, they associated a sense-tag with every instance of the word in the corpus. The rules and strategies they followed in the corpus-tagging and dictionary-writing are described in their Policy Guide.
Lexical sample
The 35 words used in English SENSEVAL, their distribution according to part of speech, and
the numbers of test instances associated with each (N), were:
| Nouns | Verbs | Adjectives | Indeterminates | ||||
| -n | N | -v | N | -a | N | -p | N |
| accident | 267 | amaze | 70 | brilliant | 229 | band | 302 |
| behaviour | 279 | bet | 177 | deaf | 122 | bitter | 373 |
| bet | 274 | bother | 209 | floating | 47 | hurdle | 323 |
| disability | 160 | bury | 201 | generous | 227 | sanction | 431 |
| excess | 186 | calculate | 217 | giant | 97 | shake | 356 |
| float | 75 | consume | 186 | modest | 270 | ||
| giant | 118 | derive | 216 | slight | 218 | ||
| knee | 251 | float | 229 | wooden | 195 | ||
| onion | 214 | invade | 207 | ||||
| promise | 113 | promise | 224 | ||||
| rabbit | 221 | sack | 178 | ||||
| sack | 82 | scrap | 186 | ||||
| scrap | 156 | seize | 259 | ||||
| shirt | 184 | ||||||
| steering | 176 | ||||||
| TOTAL | 2756 | TOTAL | 2501 | TOTAL | 1406 | TOTAL | 1785 |
"Indeterminates" were words for which the task involved determining
major word class, as well as determining word sense. For the other
tasks, major word class was given.
For word selectin strategy, see bibliography.
Dictionary entries
The dictionary entries are available in SGML (as described in the
Policy Guide) and also in postscript,
for ease of reading. They are available in SGML, tar'd and gzipped,
here. A sample of three items (rabbit, generous and shake) is also available uncompressed here.
A postscipt versions of the full 35-word dictionary is available (gzipped) here.
Test Corpus
The test corpus was the corpus over which the SENSEVAL English
evaluation took place. There is one file for each "task", where a
task is identified as the word + one of -n, -v, -a, -p
depending on whether the items were nouns, verbs, adjectives, or
word-class-to-be-determined (see above). For several words there were
multiple tasks, eg, sack-n and sack-v. For the 35 words
there were 41 tasks.
The format for test instances is as below:
700001 John Dos Passos wrote a poem that talked of `the <tag>bitter</> beat look, the scorn on the lip." 700002 The beans almost double in size during roasting. Black beans are over roasted and will have a <tag>bitter</> flavour and insufficiently roasted beans are pale and give a colourless, tasteless drink.
Gold Standard Taggings
The "gold standard" taggings give the
sense tag(s) that the human sense-tagging team considered correct for each corpus instance. For each task there
is a gold-standard file, with the following format.
700002:kind 700003:unstint or kind 700004:copiousThis was drawn from the gold-standard file for generous-a. Each HECTOR sense is associated with a "mnemonic", and kind, copious and unstint are the mnemonics for three of the senses of generous. The sense tag for generous-a test-set instance 700002 is kind, that for 700003 is unstint or kind (human taggers wre permitted to give disjunctive taggings) and that for 700004 is copious.
Each sense had a numerical unique identifier, as well as a mnemonic. The mapping between the two is available here.
The files are available, tar'd and gzipped, here. A sample of three items (rabbit, generous and shake) is also available uncompressed here.
800004 Mr Purves is tight-lipped about what happens then. He recently vexed rumour-mongers, who <tag "520051">bet</> on a bid for Midlan sooner rather than later, by declining to disclose Hongkong Bank's inner reserves when the ban reported its 1989 results on March 13th. 800005 Of our other leisure facilities, Blackbird Leys Leisure Centre, opened last autumn was given a six-month event-free trial and will now begin to attract the large events and exhibitions to subsidise its running costs. Finally, Mr Hugh-Jones loses his <tag "519914?">bet</>: 400,000 people attended Temple Cowley pools last year.
A further dataset in this format is available on request.
For the HECTOR project, see:
A background paper on SENSEVAL:
This page is maintained by Adam Kilgarriff and was last updated on 6 July 1999.