Architectures for Multilingual Lexical Representation

Carole Tiberius
ITRI, University of Brighton

Most work on multilingual lexicons so far has assumed monolingual lexicons linked only at the level of semantics.

Cahill and Gazdar (1999) argue that this approach might be appropriate for unrelated languages, but that it makes it impossible to capture useful generalisations about related languages. Closely related languages exhibit many similarities at all levels of lexical description - morphology, phonology, morphophonology, orthography, syntax, etc. - not just semantics. Compare, for example, the forms of the verb "sing" in Dutch, English, and German:

		sing - sang - sung 	(English)

		zing - zong - gezongen 	(Dutch)
		
		sing - sang - gesungen	(German)
Such similarities, if captured, can help to produce more robust natural language processing systems for such languages. Cahill and Gazdar describe an architecture which aims to encode and exploit lexical similarities between closely related languages. They applied this architecture in the PolyLex project to define a trilingual hierarchical lexicon for Dutch, English, and German sharing morphological, phonological, and morphophonological information between these languages.

In this talk, I am going to look at the methodological and theoretical issues raised by the development of such multilingual inheritance-based lexicons. I am in particular going to focus on how such a multilingual inheritance-based lexicon could best be structured. I will discuss different architectures, the structure-sharing model, the meta-features model, and the micro-features model. I will discuss the advantages and disadvantages of these models with reference to sample lexical fragments of Danish, Dutch, English, and Icelandic.

References

Cahill, L. and G. Gazdar. 1999. The POLYLEX architecture: multilingual lexicons for related languages, In Traitement Automatique des Langues, 40:1.