Report on ELRA 10th Anniversary Workshop on Human Language Technology (HLT) Evaluation, Malta, 1-2 December 2005 Workshop: The purpose of the workshop was to "bring together the HLT Evaluation key players to discuss HLT evaluation from various perspectives: general principles and purposes, technologies, past and on-going evaluation projects, worldwide initiatives, etc." The main objective was "fruitful brainstorming on HLT evaluation starting from what is being done today and what should be done better, differently, etc.", addressing all sectors of HLT. In my view the event was a bit MT-heavy. There was only a single talk on evaluation in NLA, and nothing on summarisation. But there was a real buzz and excitement and I had the impression that everybody thought it a very fruitful event as well as thoroughly enjoying themselves. There are plans for a follow-up event at LREC'06, an HLT evaluation email list, possibly a state-of-the-art survey on HLT evaluation, several other activities, and a web portal for HLT evaluation has already been created. Programme and other workshop information available from the workshop website: http://www.elra.info/article.php3?id_article=106 Participants: There were just under 40 participants from all over Europe (mainly from France, Italy, UK and Germany), one from Japan and two from USA. About half of the participants were from universities, the others from various public and private organisations including the French Ministry of Research, IBM, Siemens, Nuance, Mitre, Vecsys, Linguatec and the European Commission. A list of participants is available from the website. Main Topics and Comments in Presentations and Discussions: 1. The biggest buzz phrase of the workshop was "permanent European infrastructure for HLT evaluation". People in academia especially, but also industry, felt strongly that we need such an infrastructure, analogous to some extent to NIST in the US. However, NIST is part of the US department of commerce, and is a large body with many permanent employees. It's hard to see where that kind of political will and money would come from in the EU, and surprisingly no one said anything concrete about how or by whom it might be funded or organised. Some people felt that the EC should provide funding, but Mats Ljungkvist (there to represent the EC) was clear that he saw no chance of that, stressing in fact that any evaluation money would have to come out of general research funding, something many people there felt was not a good idea (see below). 2. In methodological terms, many speakers and people commenting talked about "the need to evaluate HLT technology in real use". An example of bad evaluation practice where real use is not taken into account was Information Retrieval (IR), where system response time must be less than 3 seconds for user acceptance, but is not taken into account in evaluations. Generally, technology should be embedded (another buzz word) in actual applications and evaluated in this context. E.g. in voice-controlled sat nav evaluation, "you need to put the evaluator in the driving seat". 3. Evaluation of components vs. systems: overall, people felt that HLT should be evaluated at different levels. On the one hand there's a need to do black-box evaluation of software, of end-to-end systems. But it's also important to evaluate core technology and system components in their own right. E.g. in text-to-speech you need to evaluate not just the quality of the speech, but also the contribution of the speech component to an integrated system. 4. A lot of participants talked about "user-based" evaluation, on the one hand in terms of not leaving the user out of the equation - we need to assess how good the output is, but also how satisfied/able to perform a task the user is. On the other hand, the user adds another dimension of variation in many evaluation contexts and should be regarded as another embedded component, to be evaluated in their own right (e.g. some users are better/worse at information finding tasks, independently of the IR system they're using). 5. There was a general dislike (and distaste) for automatic evaluation metrics, in particular BLEU. It seemed to me that this was in part due to an incomplete understanding of how BLEU in particular works. E.g. one comment referred to the single reference text used with BLEU (intended to be used with several), and one slide contained the question "what does a BLEU score of 3.6 mean?" (BLEU scores range from 0 to 1). No results were cited any results or studies that showed BLEU (or other metrics) to produce results that disagree with human judgments (BLEU can cite long lists of citations in their favour). 6. There seems to be an expectation among some researchers that a set of reusable, task-independent evaluation features can in principle be found that could provide an answer to the question how good is this (say) MT system in general? One example that was mentioned was sets of SL/TL features for MT (yet to be discovered). However, even if such sets could be found, presumably evaluation tests would still have to be carried out for a range of different translation tasks, so feature-based evaluation may not inherently be able to provide more general results than automatic metrics. 7. A familiar irritation was felt with US superiority (actual as well as simply assumed by US research), people emphasising that Europe should set its own challenges, and that especially in MT, we should focus on and take advantage of the unique situation we find ourselves in in Europe, with 20 official EU languages for which corpora are produced routinely, every day. Related to this. there was a general consensus concerning a permanent European HLT evaluation infrastructure (see above). However, people did keep saying that we need something like the US organisation NIST (National Institute of Standards and Technology). 8. Many participants felt that evaluation should be carried out independently of, and separately from, (i) development of technology, and (ii) data creation, that this division should moreover reflect separate institutions (as e.g. in US campaigns where NIST did evaluation, and LDC did data creation). 9. Concerning the organisation of evaluation campaigns, it was my impression that the majority felt that (i) data should (at least initially) be made available only to participants, and coupled with obligations, (ii) systems, components and specific research activities should all be evaluated to allow many more groups to participate (e.g in TC-STAR: Speech-to-Speech Translation (SST) systems, speech recognition, spoken language translation an speech synthesis modules, and e.g. cutting-edge research on expressive speech).