This paper should be read in conjunction with A proposal for SENSEVAL scoring scheme by Dan Melamed and Philip Resnik (hereafter MR) which proposes, and presents the mathematics for, a probabilistic approach to scoring, available here.
Terminology: a case or instance refers to a corpus instance of a word to be tagged, with associated context. Word will be used to mean a lemma or dictionary headword.
Human taggers will sometimes give disjunctive taggings eg "sense 1 or sense 3". Also, human taggers frequently gave different tags in the first pass and in these cases, the data was sent to a fourth human tagger to edit out errors and wayward interpretations, and to reclassify other differences as disjunctions. Around 15% of gold-standard instances are disjunctions.
PROPOSAL: All scores are calculated in each of three ways:
Minimal to be used as the "score-of-reference", as it has the simplest interpretation.
These suffixes will be useful when we come to analyse the gold standard data. But for scoring, the only one it is viable to use is the "in proper noun" suffix, which can be treated as "either the specified sense or PROPER". All other suffixes to be ignored. One consequence is that there is not always a match between the POS of the sense, and the POS of the corpus instance: where a verbal sense is being used adjectivally (eg as a participle), the POS of the sense will be "v" but the POS of the instance will be "a".
I argue against this approach, as it is often not lexicographically valid to treat a sense as including its subsensenses, even though their meanings are close (see float, senses 13 and 13.1).
PROPOSAL Treat sense/subsense disagreements in the same way as other sorts of disagreements (except when scoring at the ord-and-main-sense levels only --see next section-- in which case they will collapse to the same main sense).
The ord level may look tempting to IR people. However this level of distinction is only used for band, sack, scrap, slight (once its uses for phrasal verbs are excluded) out of the 35 test words.
Two modes of scoring: coarse looks only at ord and main-sense levels, fine looks at all levels. For coarse, all numbered- and lettered-subsense distinctions are collapsed to their parent main sense in both the Gold Standard and the system output. (This `collapsing' in the Gold Standard will mean that the set of instances to be used for Minimal scoring may increase, as subsense-level disjunctions in the gold standard will cease being disjunctions when viewed coarsely.)
In both cases all noun compounds and phrasal verbs are treated as distinct, as in HECTOR.
MR scoring pays heed to the hierarchical structure of the entry. However, phrasal verbs and noun compounds are addressed outside the semantic hierarchy, and ord labels are rare, so it is inappropriate to treat any levels above the main-sense level as hierarchical. Lexical entries will not be treated as having any hierarchy above the main sense level.
SENSEVAL will not make any use of the lettered-vs.-numbered subsense distinction.
Minor word-class distinctions (eg count vs. mass nouns, trans vs. intrans verbs) play no particular role in the scoring. If HECTOR distinguishes them, they are distinct for evaluation purposes, if not, not.
The one complication the human taggers have noted has only indirect bearing on the scoring, but I mention it here for completeness. It relates to variability in MWEs., eg, should cook in "Too many cooks!" be classified as the idiomatic sense where the full form of the idiom is "too many cooks spoil the broth"?
The HECTOR lexicography takes semantics as primary, with syntax taking a secondary role. Groupings are firstly on the basis of meaning. This seems broadly appropriate for WSD, which is a semantic tagging task. One outcome is that the syntactic coding occurring under the "gr" and "clues" tags in HECTOR lexical entries is not to be taken as definitive. The human taggers have noted many occasions where a corpus instance fits the meaning of a given sense but does not match the grammatical coding. In such cases their instructions have been to give precedence to the meaning. Also the "HECTOR Lexicographical Policy and Procedures" document does not specify whether the default reading for grammatical codes is that they always apply when a word is being used in that sense, or that they are salient for the sense in some weaker way. The taggers' evidence suggests that the always reading would not be appropriate.
Minor word classes as specified under "gr" and "clues" should only be read as indicative, not as a necessary condition for the sense.
Systems may return single tags or multiple tags for an instance, and if multiple tags are returned, these may be either weighted or unweighted. Details of format for returning results here.
Many systems will be disambiguating according to the dictionary they usually use, and then mapping to HECTOR senses. Since such mappings are never entirely one-to-one, the mapping involves information loss.
It is not desirable that systems are scored more harshly because they used a different inventory, and mapped, but, given the problems of dictionary-mapping, it is hard to avoid.
Mapping from WordNet 1.5 and WordNet 1.6 to HECTOR are available.
I do not see this as presenting any problem. Research involving less person-months, or less experienced researchers, is likely to cover less of the data, but such research teams are to be encouraged to participate. Percentage-correct scores based on 25% of the whole dataset can be compared with ones based on the whole dataset (though clearly, if, eg, nouns are easier than verbs and a system does only nouns, it should be compared with other systems' performance on nouns only.)
Other systems may fail to attempt to disambiguate an instance, not because it could not handle that kind of case in principle, but because there was insufficient evidence in that particular case. This is a different sort of issue, relating to IR recall. I do not know if any participating systems will operate in this way.
PROPOSAL: for each word, for each system that could have attempted to dis a "percentage attempted" figure is provided alongside the "percentage correct". (Default is 100%.)
The percentage correct for that word is then calculated in six ways: minimal, generous, and stingy, all either coarse or fine.
There will also be four figures for percentage-applicability: an overall figure, then broken down into three classes of reason for instances not counting:
The basic form of the results will be this set of 10 figures for each cell in an N-by-41 grid (where N is the number of participating systems and 41 is the number of tasks, each defined as a word and either a POS or "p", signifying "all-POSes"). Some cells will be empty because that system did not attempt that task.) Each of the N tasks will be associated with between 47 and 431 instances.
Global figures, for a class (eg all nouns) or for the whole set can be produced in the same way as word-specific figures. Of course, we should be extremely wary of such figures as they will usually gloss over a multitude of very different results.
20 July 1998
Back to Main SENSEVAL page