"Metrics for corpus similarity and homogeneity" Adam Kilgarriff University of Brighton How similar are two corpora? A measure of corpus similarity would be very useful for NLP for, eg, estimating the cost of porting a system from one domain to another. Corpus similarity can only be interpreted in the light of corpus homogeneity. The information theoretic measure 'perplexity' has often been used to indicate the homogeneity of texts. In this paper we show how perplexity and cross-entropy can be used to measure corpus homogeneity and similarity. We then compare these measures with the word-frequency-based measures presented in Kilgarriff (1997). The difficulties of defining what we mean by 'corpus similarity' are briefly discussed. The various metrics are evaluated using purpose-built sets of "known-similarity corpora" in a method which substantially circumvents the theoretical difficulties, and a chi-square based measure is shown to be the best of those tested.