Last revision: 16 October 1998
Coarse-grained scoring assimilates all subsense tags
(corresponding to codes such as 1.1, 2.1) to main sense tags
(corresponding to codes such as 1, 2) in both the answer file and in
the key file. Hence a guess of 1.1 in the answer file counts as an
exact match of a correct answer of 1, 1.1 or 1.2 in the key file under
coarse-grained scoring.
Mixed-grained scoring gives full credit for a guess in the
answer file if it is subsumed by an answer in the key file, and
partial credit if it subsumes such an answer. A tag subsumes another
tag if it is a main sense tag (corresponding to a code such as
2) and the other tag is a subsense tag under it (corresponding
to a code such as 2.1). The amount of partial credit awarded depends
on how many other sense tags a guess subsumes in addition to those
that are correct. Details can be found at
http://www.itri.brighton.ac.uk/events/senseval/mr.asc.
Fine-grained scoring counts only identical sense tags as a
match. That is, even if the guess in the answer file subsumes or is
subsumed by the correct answer in the key, no credit at all is
given. For example, under fine-grained scoring, a guess of 2.1
receives no credit if the answer in the key file is 2.
The results of minimal scoring do not appear in the task summaries
but are recorded in the detailed results section of each task.
For additional details, see
www.itri.brighton.ac.uk/events/senseval/score.html.
Two sets of baselines are used: those that make use of the corpus
training data, and those that only make use of the definitions in the
dictionary. The baselines which use training data are intended for
comparison with the systems that likewise rely on supervised training,
while the ones that use only the dictionary are suitable for
comparisons with systems based on dictionaries alone (All-words category).
None of the baselines in either set draws on any form of linguistic
knowledge, except for the baselines incorporating the phrase filter, which must be able to
recognize the inflected forms of words, and which also makes use of
very rudimentary ordering constraints for multi-word expressions. However, all the
baselines, like the systems, are free to exploit the pre-specified
part-of-speech tags of the words to be disambiguated in the -n, -v and -a files. This information constitutes a
shallow form of linguistic knowledge. Some of the baselines also make
use of the root forms of the words to be disambiguated. The root form
is given as the prefix of the file name that a test item occurs
in. This information therefore is also available to be used by all
systems. If it were not given by the file name, however, a linguistic
analysis would be required to determine this information.
The phrase filter is a pre-processor that is added to some of
the baselines in order to improve their handling of multi-word expressions. If it recognizes any
multi-word expressions in the test sentence, the phrase filter rules
out all sense tags for non-multi-words. It also rules out the sense
tags for any multi-word expressions that it can't find evidence for.
The remaining sense tags (which correspond either exclusively to
multi-word expressions for which evidence is present, or else
exclusively to non-multi-words) are passed along to the baseline
algorithm, which chooses from among them based on whatever selection
strategy it is using, without ever considering the sense tags that the
phrase filter has eliminated.
To check for the presence of a multi-word expression, the phrase
filter looks for a sequence of words in the test sentence which are in
the same order as the words of the expression as it appears in the
dictionary, and which include the test word to be disambiguated
itself. Also, for most multi-words, the sequence of words must be
consecutive, with no intervening material between words. For a few
multi-words, though, intervening words or symbols are allowed, as long
as the precedence order of the sequence is correct. These exceptional
cases are listed in a table (coded manually based on a shallow
analysis of the dictionary entries for multi-words), and mostly
involve verb-particle constructions where a noun phrase may sometimes
occur between the verb and the particle. The phrase filter also
consults a table of morphological inflections, enabling it to detect
the presence of inflected forms of the words in multi-word
expressions.
The random baseline is computed with equal weight for all sense
tags in the dictionary that match a particular test word's root form
and its part of speech. If the part of speech is not known (as is the
case in the -p files), equal weight is
given to all sense tags for all parts of speech. Sense tags for proper
nouns, typos and the UNASSIGNABLE tag are left out.
The random-with-phrase-filter baseline is the random
baseline coupled with the phrase filter
pre-processor.
The random-main baseline is like the random baseline but
limits its guesses to only the main-sense tags listed for a word or
word/part-of-speech pair.
The most-examples baseline uses a dictionary-based strategy
that assigns, for each test word's root form and part of speech, the
sense for which the most examples are provided in the dictionary. If
the part of speech is not known (as is the case in the -p files), all senses for that word,
regardless of part of speech, are considered. If several senses of the
word have an equal number of examples, these are all guessed, with
equal weight assigned to each. If no candidate senses have examples,
no guess is made.
The most-examples-with-phrase-filter baseline is the
most-examples baseline coupled with the phrase filter pre-processor.
The most-examples-subsumer baseline uses the same strategy but
counts, in addition to the examples for a given sense, also the
examples of all the subsenses it subsumes, and chooses the senses with
the highest tally.
The lesk baseline uses a simplification of the strategy
suggested by Lesk (1986), choosing essentially the sense of a test
word's root form and part of speech whose dictionary definition and
example texts have the most words in common with the words around the
instance to be disambiguated. More specifically, each unique word form
that occurs in the same sentence as the ambiguous test word is counted
as evidence towards each candidate dictionary sense whose entry also
contains that word. If the part of speech is not known (as is the
case in the -p files), all possible
dictionary senses for the test word, regardless of part of speech, are
considered. The sense with the highest evidence tally is chosen. The
words in the context are not stemmed or corrected for case. Identical
word forms that occur multiple times are only counted once. Words that
occur in many dictionary definitions and example texts, such as
the and of, count for less than rarer content words,
because all words are weighted by their inverse document
frequency (where each definition or example in the dictionary is
counted as one separate document).
The lesk-with-phrase-filter baseline is the lesk
baseline coupled with the phrase filter
pre-processor.
The lesk-definitions baseline is like the lesk baseline,
but it ignores the example texts in the dictionary, comparing words in
the test item's context only to the words in each sense's dictionary
definition proper. Its inverse document frequency computation differs
from that of the lesk baseline since it disregards any words
that only occur in examples, and only counts definitions as documents.
The lesk-definitions-with-phrase-filter baseline is the
lesk-definitions baseline coupled with the phrase filter pre-processor.
The commonest baseline is computed by choosing the most
frequent of the training-corpus sense tags that match a particular
test word's root form and part of speech. If the part of speech is
not known (as is the case in the -p
files), all senses for that word, regardless of part of speech, are
considered. The frequency calculation ignores cases involving multiple
sense tags; the only sense tags that are counted are those that occur
alone with an ambiguous training-corpus word, not those that occur
together with a proper-noun tag, an UNASSIGNABLE tag or another
dictionary sense tag. The commonest baseline abstains from guessing on
the words for which no training-data sense tag frequencies are
available.
The commonest-for-inflected baseline guesses the
training-corpus sense tags that occur most often with the inflected
form of the test word that is to be disambiguated, and that match the
part of speech of the test word (if this known). No analysis of the
word form is undertaken, nor is any case correction done for words
that are capitalized, either because they occur at the start of a
sentence or for other reasons. If the part of speech is not known (as
is the case in the -p files), all senses
for that word, regardless of part of speech, are considered. This
baseline does not make a guess if corpus sense-tag frequencies for a
word are not available.
The commonest-for-inflected-with-phrase-filter baseline is the
commonest-for-inflected baseline coupled with the phrase filter pre-processor.
The lesk-plus-corpus baseline is like the lesk baseline,
but it also considers information about the collocates of sense-tagged
words in the training corpus. For each word in the sentence containing
the test item, this baseline not only tests whether that word occurs
in the dictionary entry for a candidate sense, but also checks to see
if it appears in the same sentence as one of the instances of that
sense in the training corpus. Words are weighted by their inverse
document frequency in both the dictionary and the training corpus
(where each definition or example in the dictionary is counted as one
separate document, and also each set of training-corpus contexts for a
sense tag is counted as a single additional document). For sense tags
which do not appear in the training corpus, the baseline reverts to
the strategy of the unsupervised lesk algorithm, but with the benefit
of corpus-derived inverse document frequency weights for words.
The lesk-plus-corpus-with-phrase-filter baseline is the
lesk-plus-corpus baseline coupled with the phrase filter pre-processor.
For more details, please see
http://www.cis.upenn.edu/~josephr/senseval-results.ps.gz.
Average system results are unweighted, that is, each system counts
equally towards the average for a task, no matter how many or few
items from the task that system attempts. Best and worst system
results are ranked by precision; recall measures are only used for
secondary ranking in case of a tie in the precision ranking. The best
(or worst) system on a particular task under one scoring method may
not be the same as the best (or worst) system under another scoring
method. Therefore, the three scores in the best-system (or
worst-system) row (for fine-grained,
mixed-grained and coarse-grained scoring) do not necessarily all
belong to the same system.
Some systems are listed as attempting a fractional number of test
items. This is because systems are allowed to guess multiple sense
tags for each test item, as long as the probability asssigned to all
guessed tags does not exceed 1.0. If a system assigns less than 1.0
probability mass cumulatively to its guesses for an item, though, it
is counted as having made a fractional guess; the remainder of the
probability mass is treated as counting towards an abstention from
guessing.
Entropy measures are listed for each task. For tasks involving
the disambiguation of only one word or word/part-of-speech pair, these
are computed by taking the distribution of sense tags for that word or
word/part-of-speech pair in the training and test corpora as a
probability distribution, with the probability of each tag
corresponding to the number of times the word or word/part-of-speech
pair appears with that tag divided by the total number of times the
word or word/part-of-speech pair appears. The entropy is then computed
for this probability distribution on sense tags in the normal way. For
tasks involving several words, the entropy is the average of the
entropy for each word or word/part-of-speech pair involved, weighted
by the number of times those words or word/part-of-speech pairs occur
in the task. The fine-grained entropy is measured in this way
on all sense tags, while the coarse-grained entropy is measured
by first converting all subsense-level tags in the corpora to their
corresponding main-sense-level tags, and then computing the
probability distribution on only these. For more details, please see
http://www.cis.upenn.edu/~josephr/senseval-results.ps.gz.
If no systems of a particular category attempted a particular task,
there is no summary table for systems of that category in the task
summary.
The tasks trainable-verbs and untrainable-verbs are
missing from the list of tasks because all verbs are in the
trainable category. Hence trainable-verbs is equivalent to
verbs and untrainable-verbs is a task with no items. The tasks
untrainable-adjectives and untrainable-indeterminates
are equivalent to
deaf-a and
hurdle-p respectively since these are the only adjective and indeterminate
part-of-speech files in the evaluation set for which no training data
was supplied.
For each task, a second set of statistics is also supplied. These
reflect results submitted only after the main deadline for the
evaluation.
eval
(after deadline)
trainable
(after deadline)
untrainable
(after deadline)
multi-word
(after deadline)
simple-word
(after deadline)
unassignable
(after deadline)
proper
(after deadline)
nouns
(after deadline)
all-nouns
(after deadline)
verbs
(after deadline)
all-verbs
(after deadline)
adjectives
(after deadline)
all-adjectives
(after deadline)
indeterminates
(after deadline)
determinates
(after deadline)
trainable-nouns
(after deadline)
all-trainable-nouns
(after deadline)
untrainable-nouns
(after deadline)
all-untrainable-nouns
(after deadline)
trainable-adjectives
(after deadline)
all-trainable-adjectives
(after deadline)
all-untrainable-adjectives
(after deadline)
trainable-indeterminates
(after deadline)
trainable-determinates
(after deadline)
low-polysemy
(after deadline)
high-polysemy
(after deadline)
low-entropy
(after deadline)
high-entropy
(after deadline)Granularity of scoring
Each participating system is scored in the following ways on each task:
Minimal scoring
The minimal score of a system on a task is the score of the system on
the subset of items in the task that are tagged with exactly one
sense tag in the key. Hence, items that are tagged with multiple
sense tags in the key, due to annotator disagreement or other
complications, are ignored in this scoring. Minimal scoring is done
for all granularities of scoring. Items that are minimal under
coarse-grained scoring though are not necessarily also minimal under
fine-grained scoring, since an answer with multiple fine-grained sense
tags in the key may collapse to a single coarse-grained sense tag if
all the fine-grained tags are subsenses of the same main sense.
System categories
Systems are classified into three categories. Comparisons should only
be made between systems of the same category.
Baselines
For comparison, a number of baselines have been included in the
scoring summaries and detailed results.
(comparable to All-words systems)
(comparable to All-words systems)
(comparable to All-words systems)
(comparable to All-words systems)
(comparable to All-words systems)
(comparable to All-words
systems)
(comparable to All-words systems)
(comparable to All-words systems)
(comparable to All-words systems)
(comparable to All-words systems)
(comparable to Supervised-training
systems)
(comparable to Supervised-training systems)
(comparable to Supervised-training
systems)
(comparable to Supervised-training
systems)
(comparable to Supervised-training
systems)
Human performance
There is one set of scores for each task corresponding to the
annotations made by the human lexicographers who initially
marked up the test corpus (independently of the construction of the
gold standard). This should give a rough idea of the human performance
on each task. In general, human performance figures are higher than
those typically reported in the sense-disambiguation literature, but
these results may not be meaningfully comparable to those of other
experiments. Therefore, unwarranted significance should not be read
into these figures.
Task summaries
A list of task summaries for the main task and each of the subtasks in
the evaluation can be found below. Along with each summary, detailed
results for the task are also reported.
The main task, comprising all test items.
The subset of test items in files of words for which corpus training data was supplied.
The complement of trainable.
The subset of test items tagged with sense tags for word forms that are not derivable from the root form of a test word by a regular morphological process.
The complement of multi-word.
The subset of test items tagged with an UNASSIGNABLE sense tag in the key.
The subset of test items tagged with a PROPER NOUN sense tag in the key.
All test items in files with -n suffix.
All test items in nouns, plus test items in files with -p suffix that were tagged with noun sense tags in the key.
All test items in files with -v suffix.
All test items in verbs, plus test items in files with -p suffix that were tagged with verb sense tags in the key.
All test items in files with -a suffix.
All test items in adjectives, plus test items in files with -p suffix that were tagged with adjective sense tags in the key.
All test items in files with -p suffix; the part of speech of the word to be disambiguated has not been predetermined.
All test items for ambiguous words with a predetermined part of speech; the complement of indeterminates.
The intersection of nouns and trainable.
The intersection of all-nouns and trainable.
The intersection of nouns and untrainable.
The intersection of all-nouns and untrainable.
The intersection of adjectives and trainable.
The intersection of all-adjectives and trainable.
The intersection of all-adjectives and untrainable.
The intersection of indeterminates and trainable.
The intersection of determinates and trainable.
Items involving words whose polysemy is less than the median polysemy of 8.
Items involving words whose polysemy is equal to or greater than the median polysemy of 8.
Items involving words whose entropy is less than the median entropy of 1.85.
Items involving words whose entropy is equal to or greater than the median entropy of 1.85.