SENSEVAL SCORES

NOTE: The preliminary statistics reported here are still subject to amendment.

Last revision: 16 October 1998

Granularity of scoring

Each participating system is scored in the following ways on each task:

Coarse-grained scoring assimilates all subsense tags (corresponding to codes such as 1.1, 2.1) to main sense tags (corresponding to codes such as 1, 2) in both the answer file and in the key file. Hence a guess of 1.1 in the answer file counts as an exact match of a correct answer of 1, 1.1 or 1.2 in the key file under coarse-grained scoring.

Mixed-grained scoring gives full credit for a guess in the answer file if it is subsumed by an answer in the key file, and partial credit if it subsumes such an answer. A tag subsumes another tag if it is a main sense tag (corresponding to a code such as 2) and the other tag is a subsense tag under it (corresponding to a code such as 2.1). The amount of partial credit awarded depends on how many other sense tags a guess subsumes in addition to those that are correct. Details can be found at http://www.itri.brighton.ac.uk/events/senseval/mr.asc.

Fine-grained scoring counts only identical sense tags as a match. That is, even if the guess in the answer file subsumes or is subsumed by the correct answer in the key, no credit at all is given. For example, under fine-grained scoring, a guess of 2.1 receives no credit if the answer in the key file is 2.

Minimal scoring

The minimal score of a system on a task is the score of the system on the subset of items in the task that are tagged with exactly one sense tag in the key. Hence, items that are tagged with multiple sense tags in the key, due to annotator disagreement or other complications, are ignored in this scoring. Minimal scoring is done for all granularities of scoring. Items that are minimal under coarse-grained scoring though are not necessarily also minimal under fine-grained scoring, since an answer with multiple fine-grained sense tags in the key may collapse to a single coarse-grained sense tag if all the fine-grained tags are subsenses of the same main sense.

The results of minimal scoring do not appear in the task summaries but are recorded in the detailed results section of each task.

For additional details, see www.itri.brighton.ac.uk/events/senseval/score.html.

System categories

Systems are classified into three categories. Comparisons should only be made between systems of the same category.

A (All-words)
All-words systems disambiguate all content words (or, at least, all content words of a given grammatical category) in a text.

S (Supervised-training)
Supervised training systems require a substantial quantity (eg over 30) sense-tagged instances of each word they are to disambiguate.

O (Other-training)
These systems do not require over 30 tagged training instances, but do require a learning phase to be applied for each word to be disambiguated. Such systems apply only to lexical samples, and scaling up from a system which disambiguates instances of, eg, 35 words to one which disambiguates a full vocabulary of, eg, 20,000 ambiguous words has not been done and would be non-trivial.

Baselines

For comparison, a number of baselines have been included in the scoring summaries and detailed results.

Two sets of baselines are used: those that make use of the corpus training data, and those that only make use of the definitions in the dictionary. The baselines which use training data are intended for comparison with the systems that likewise rely on supervised training, while the ones that use only the dictionary are suitable for comparisons with systems based on dictionaries alone (All-words category).

None of the baselines in either set draws on any form of linguistic knowledge, except for the baselines incorporating the phrase filter, which must be able to recognize the inflected forms of words, and which also makes use of very rudimentary ordering constraints for multi-word expressions. However, all the baselines, like the systems, are free to exploit the pre-specified part-of-speech tags of the words to be disambiguated in the -n, -v and -a files. This information constitutes a shallow form of linguistic knowledge. Some of the baselines also make use of the root forms of the words to be disambiguated. The root form is given as the prefix of the file name that a test item occurs in. This information therefore is also available to be used by all systems. If it were not given by the file name, however, a linguistic analysis would be required to determine this information.

The phrase filter is a pre-processor that is added to some of the baselines in order to improve their handling of multi-word expressions. If it recognizes any multi-word expressions in the test sentence, the phrase filter rules out all sense tags for non-multi-words. It also rules out the sense tags for any multi-word expressions that it can't find evidence for. The remaining sense tags (which correspond either exclusively to multi-word expressions for which evidence is present, or else exclusively to non-multi-words) are passed along to the baseline algorithm, which chooses from among them based on whatever selection strategy it is using, without ever considering the sense tags that the phrase filter has eliminated.

To check for the presence of a multi-word expression, the phrase filter looks for a sequence of words in the test sentence which are in the same order as the words of the expression as it appears in the dictionary, and which include the test word to be disambiguated itself. Also, for most multi-words, the sequence of words must be consecutive, with no intervening material between words. For a few multi-words, though, intervening words or symbols are allowed, as long as the precedence order of the sequence is correct. These exceptional cases are listed in a table (coded manually based on a shallow analysis of the dictionary entries for multi-words), and mostly involve verb-particle constructions where a noun phrase may sometimes occur between the verb and the particle. The phrase filter also consults a table of morphological inflections, enabling it to detect the presence of inflected forms of the words in multi-word expressions.

The random baseline is computed with equal weight for all sense tags in the dictionary that match a particular test word's root form and its part of speech. If the part of speech is not known (as is the case in the -p files), equal weight is given to all sense tags for all parts of speech. Sense tags for proper nouns, typos and the UNASSIGNABLE tag are left out.
(comparable to All-words systems)

The random-with-phrase-filter baseline is the random baseline coupled with the phrase filter pre-processor.
(comparable to All-words systems)

The random-main baseline is like the random baseline but limits its guesses to only the main-sense tags listed for a word or word/part-of-speech pair.
(comparable to All-words systems)

The most-examples baseline uses a dictionary-based strategy that assigns, for each test word's root form and part of speech, the sense for which the most examples are provided in the dictionary. If the part of speech is not known (as is the case in the -p files), all senses for that word, regardless of part of speech, are considered. If several senses of the word have an equal number of examples, these are all guessed, with equal weight assigned to each. If no candidate senses have examples, no guess is made.
(comparable to All-words systems)

The most-examples-with-phrase-filter baseline is the most-examples baseline coupled with the phrase filter pre-processor.
(comparable to All-words systems)

The most-examples-subsumer baseline uses the same strategy but counts, in addition to the examples for a given sense, also the examples of all the subsenses it subsumes, and chooses the senses with the highest tally.
(comparable to All-words systems)

The lesk baseline uses a simplification of the strategy suggested by Lesk (1986), choosing essentially the sense of a test word's root form and part of speech whose dictionary definition and example texts have the most words in common with the words around the instance to be disambiguated. More specifically, each unique word form that occurs in the same sentence as the ambiguous test word is counted as evidence towards each candidate dictionary sense whose entry also contains that word. If the part of speech is not known (as is the case in the -p files), all possible dictionary senses for the test word, regardless of part of speech, are considered. The sense with the highest evidence tally is chosen. The words in the context are not stemmed or corrected for case. Identical word forms that occur multiple times are only counted once. Words that occur in many dictionary definitions and example texts, such as the and of, count for less than rarer content words, because all words are weighted by their inverse document frequency (where each definition or example in the dictionary is counted as one separate document).
(comparable to All-words systems)

The lesk-with-phrase-filter baseline is the lesk baseline coupled with the phrase filter pre-processor.
(comparable to All-words systems)

The lesk-definitions baseline is like the lesk baseline, but it ignores the example texts in the dictionary, comparing words in the test item's context only to the words in each sense's dictionary definition proper. Its inverse document frequency computation differs from that of the lesk baseline since it disregards any words that only occur in examples, and only counts definitions as documents.
(comparable to All-words systems)

The lesk-definitions-with-phrase-filter baseline is the lesk-definitions baseline coupled with the phrase filter pre-processor.
(comparable to All-words systems)

The commonest baseline is computed by choosing the most frequent of the training-corpus sense tags that match a particular test word's root form and part of speech. If the part of speech is not known (as is the case in the -p files), all senses for that word, regardless of part of speech, are considered. The frequency calculation ignores cases involving multiple sense tags; the only sense tags that are counted are those that occur alone with an ambiguous training-corpus word, not those that occur together with a proper-noun tag, an UNASSIGNABLE tag or another dictionary sense tag. The commonest baseline abstains from guessing on the words for which no training-data sense tag frequencies are available.
(comparable to Supervised-training systems)

The commonest-for-inflected baseline guesses the training-corpus sense tags that occur most often with the inflected form of the test word that is to be disambiguated, and that match the part of speech of the test word (if this known). No analysis of the word form is undertaken, nor is any case correction done for words that are capitalized, either because they occur at the start of a sentence or for other reasons. If the part of speech is not known (as is the case in the -p files), all senses for that word, regardless of part of speech, are considered. This baseline does not make a guess if corpus sense-tag frequencies for a word are not available.
(comparable to Supervised-training systems)

The commonest-for-inflected-with-phrase-filter baseline is the commonest-for-inflected baseline coupled with the phrase filter pre-processor.
(comparable to Supervised-training systems)

The lesk-plus-corpus baseline is like the lesk baseline, but it also considers information about the collocates of sense-tagged words in the training corpus. For each word in the sentence containing the test item, this baseline not only tests whether that word occurs in the dictionary entry for a candidate sense, but also checks to see if it appears in the same sentence as one of the instances of that sense in the training corpus. Words are weighted by their inverse document frequency in both the dictionary and the training corpus (where each definition or example in the dictionary is counted as one separate document, and also each set of training-corpus contexts for a sense tag is counted as a single additional document). For sense tags which do not appear in the training corpus, the baseline reverts to the strategy of the unsupervised lesk algorithm, but with the benefit of corpus-derived inverse document frequency weights for words.
(comparable to Supervised-training systems)

The lesk-plus-corpus-with-phrase-filter baseline is the lesk-plus-corpus baseline coupled with the phrase filter pre-processor.
(comparable to Supervised-training systems)

For more details, please see http://www.cis.upenn.edu/~josephr/senseval-results.ps.gz.

Human performance

There is one set of scores for each task corresponding to the annotations made by the human lexicographers who initially marked up the test corpus (independently of the construction of the gold standard). This should give a rough idea of the human performance on each task. In general, human performance figures are higher than those typically reported in the sense-disambiguation literature, but these results may not be meaningfully comparable to those of other experiments. Therefore, unwarranted significance should not be read into these figures.

Task summaries

A list of task summaries for the main task and each of the subtasks in the evaluation can be found below. Along with each summary, detailed results for the task are also reported.

Average system results are unweighted, that is, each system counts equally towards the average for a task, no matter how many or few items from the task that system attempts. Best and worst system results are ranked by precision; recall measures are only used for secondary ranking in case of a tie in the precision ranking. The best (or worst) system on a particular task under one scoring method may not be the same as the best (or worst) system under another scoring method. Therefore, the three scores in the best-system (or worst-system) row (for fine-grained, mixed-grained and coarse-grained scoring) do not necessarily all belong to the same system.

Some systems are listed as attempting a fractional number of test items. This is because systems are allowed to guess multiple sense tags for each test item, as long as the probability asssigned to all guessed tags does not exceed 1.0. If a system assigns less than 1.0 probability mass cumulatively to its guesses for an item, though, it is counted as having made a fractional guess; the remainder of the probability mass is treated as counting towards an abstention from guessing.

Entropy measures are listed for each task. For tasks involving the disambiguation of only one word or word/part-of-speech pair, these are computed by taking the distribution of sense tags for that word or word/part-of-speech pair in the training and test corpora as a probability distribution, with the probability of each tag corresponding to the number of times the word or word/part-of-speech pair appears with that tag divided by the total number of times the word or word/part-of-speech pair appears. The entropy is then computed for this probability distribution on sense tags in the normal way. For tasks involving several words, the entropy is the average of the entropy for each word or word/part-of-speech pair involved, weighted by the number of times those words or word/part-of-speech pairs occur in the task. The fine-grained entropy is measured in this way on all sense tags, while the coarse-grained entropy is measured by first converting all subsense-level tags in the corpora to their corresponding main-sense-level tags, and then computing the probability distribution on only these. For more details, please see http://www.cis.upenn.edu/~josephr/senseval-results.ps.gz.

If no systems of a particular category attempted a particular task, there is no summary table for systems of that category in the task summary.

The tasks trainable-verbs and untrainable-verbs are missing from the list of tasks because all verbs are in the trainable category. Hence trainable-verbs is equivalent to verbs and untrainable-verbs is a task with no items. The tasks untrainable-adjectives and untrainable-indeterminates are equivalent to deaf-a and hurdle-p respectively since these are the only adjective and indeterminate part-of-speech files in the evaluation set for which no training data was supplied.

For each task, a second set of statistics is also supplied. These reflect results submitted only after the main deadline for the evaluation.


eval (after deadline)
The main task, comprising all test items.

trainable (after deadline)
The subset of test items in files of words for which corpus training data was supplied.

untrainable (after deadline)
The complement of trainable.

multi-word (after deadline)
The subset of test items tagged with sense tags for word forms that are not derivable from the root form of a test word by a regular morphological process.

simple-word (after deadline)
The complement of multi-word.

unassignable (after deadline)
The subset of test items tagged with an UNASSIGNABLE sense tag in the key.

proper (after deadline)
The subset of test items tagged with a PROPER NOUN sense tag in the key.

nouns (after deadline)
All test items in files with -n suffix.

all-nouns (after deadline)
All test items in nouns, plus test items in files with -p suffix that were tagged with noun sense tags in the key.

verbs (after deadline)
All test items in files with -v suffix.

all-verbs (after deadline)
All test items in verbs, plus test items in files with -p suffix that were tagged with verb sense tags in the key.

adjectives (after deadline)
All test items in files with -a suffix.

all-adjectives (after deadline)
All test items in adjectives, plus test items in files with -p suffix that were tagged with adjective sense tags in the key.

indeterminates (after deadline)
All test items in files with -p suffix; the part of speech of the word to be disambiguated has not been predetermined.

determinates (after deadline)
All test items for ambiguous words with a predetermined part of speech; the complement of indeterminates.

trainable-nouns (after deadline)
The intersection of nouns and trainable.

all-trainable-nouns (after deadline)
The intersection of all-nouns and trainable.

untrainable-nouns (after deadline)
The intersection of nouns and untrainable.

all-untrainable-nouns (after deadline)
The intersection of all-nouns and untrainable.

trainable-adjectives (after deadline)
The intersection of adjectives and trainable.

all-trainable-adjectives (after deadline)
The intersection of all-adjectives and trainable.

all-untrainable-adjectives (after deadline)
The intersection of all-adjectives and untrainable.

trainable-indeterminates (after deadline)
The intersection of indeterminates and trainable.

trainable-determinates (after deadline)
The intersection of determinates and trainable.

low-polysemy (after deadline)
Items involving words whose polysemy is less than the median polysemy of 8.

high-polysemy (after deadline)
Items involving words whose polysemy is equal to or greater than the median polysemy of 8.

low-entropy (after deadline)
Items involving words whose entropy is less than the median entropy of 1.85.

high-entropy (after deadline)
Items involving words whose entropy is equal to or greater than the median entropy of 1.85.

accident-n (after deadline)

behaviour-n (after deadline)

bet-n (after deadline)

disability-n (after deadline)

excess-n (after deadline)

float-n (after deadline)

giant-n (after deadline)

knee-n (after deadline)

onion-n (after deadline)

promise-n (after deadline)

rabbit-n (after deadline)

sack-n (after deadline)

scrap-n (after deadline)

shirt-n (after deadline)

steering-n (after deadline)

amaze-v (after deadline)

bet-v (after deadline)

bother-v (after deadline)

bury-v (after deadline)

calculate-v (after deadline)

consume-v (after deadline)

derive-v (after deadline)

float-v (after deadline)

invade-v (after deadline)

promise-v (after deadline)

sack-v (after deadline)

scrap-v (after deadline)

seize-v (after deadline)

brilliant-a (after deadline)

deaf-a (after deadline)

floating-a (after deadline)

generous-a (after deadline)

giant-a (after deadline)

modest-a (after deadline)

slight-a (after deadline)

wooden-a (after deadline)

band-p (after deadline)

bitter-p (after deadline)

hurdle-p (after deadline)

sanction-p (after deadline)

shake-p (after deadline)