|14.00||Opening Remarks (Co-chair)|
|14.15||Paul Rayson and Roger Garside, Lancaster University, UK||Comparing Corpora Using Frequency Profiling|
|14.40||George Tambouratzis, Stella Markantonatou, Nikolaos Hairetakis, Marina Vassiliou, Dimitrios Tambouratzis and George Carayannis, ILSP, Athens Greece||Discriminating the registers and styles in the Modern Greek Language|
|15.05||Patrick Ruch and Arnaud Gaudinat, Geneva University Hospital and University of Geneva, Switzerland||Comparing Corpora and Lexical Ambiguity|
|15.45||Chikashi Nobata, Nigel Collier and Jun'ichi Tsujii, Kansai Advanced Research Center and University of Tokyo, Japan||Comparison between Tagged Corpora for the Named Entity Task|
|16.10||Douglas Roland, Daniel Jurafsky, Lise Menn, Susanne Gahl, Elizabeth Elder and Chris Riddoch; Colorado and Harvard Universities, USA||Verb Subcategorization Frequency Differences between Business-News and Balanced Corpora: the role of verb sense|
|16.35||Discussion||The role and importance of comparing corpora: the way forward|
Paul Rayson and Roger Garside, Lancaster University, UK
Comparing Corpora Using Frequency Profiling
This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or word-sense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.
et al, ILSP, Athens Greece
Discriminating the registers and styles in the Modern Greek Language
This paper reports on the discrimination of registers and styles in written Modern Greek. Our research has focused on modern Greek political speech as recorded in the Greek Parliament Proceedings and investigates (i) the relationship between this register and other registers such as fiction and academic prose, as well as (ii) the variation of styles within this register. The application of clustering techniques indicates that the particular political speech texts form a cluster distinct to other registers. The use of discriminant analysis techniques indicates that the styles of individual speakers within the particular political speech register may be discriminated with a high degree of accuracy.
Patrick Ruch and Arnaud Gaudinat, Geneva University Hospital and
University of Geneva, Switzerland
Comparing Corpora and Lexical Ambiguity
In this paper we compare two types of corpus, focusing on the lexical ambiguity of each of them. The first corpus consists mainly of newspaper articles and literature excerpts, while the second belongs to the medical domain. To conduct the study, we have used two different disambiguation tools. However, first of all, we must verify the performance of each system in its respective application domain. We then use these systems in order to assess and compare both the general ambiguity rate and the particularities of each domain. Quantitative results show that medical documents are lexically less ambiguous than unrestricted documents. Our conclusions show the importance of the application area in the design of NLP tools.
Chikashi Nobata, Nigel Collier and Jun'ichi Tsujii, Kansai Advanced
Research Center and University of Tokyo, Japan
Comparison between Tagged Corpora for the Named Entity Task
We present two measures for comparing corpora based on information theory statistics such as gain ratio as well as simple term-class frequency counts. We tested the predictions made by these measures about corpus difficulty in two domains (news and molecular biology) using the result of two well-used paradigms for NE, decision trees and HMMs and found that gain ratio was the more reliable predictor.
Douglas Roland, Daniel Jurafsky, Lise Menn, Susanne Gahl, Elizabeth
Elder and Chris Riddoch; Colorado and Harvard Universities,
Verb Subcategorization Frequency Differences between Business-News and Balanced Corpora: the role of verb sense
We explore the differences in verb subcategorization frequencies across several corpora in an effort to obtain stable cross corpus subcategorization probabilities for use in norming psychological experiments. For the 64 single sense verbs we looked at, subcategorization preferences were remarkably stable between British and American corpora, and between balanced corpora and financial news corpora. Of the verbs that did show differences, these differences were generally found between the balanced corpora and the financial news data. We show that all or nearly all of these shifts in subcategorization are realised via (often subtle) word sense differences. This is an interesting observation in itself, and also suggests that stable cross corpus subcategorization frequencies may be found when verb sense is adequately controlled.