The purpose of the SOLE project is to explore ways in which high-level linguistic information can improve the quality of intonation in synthetic speech. To this end, we've constructed the SOLE concept-to-speech system, which consists of a natural language generation component, a speech synthesis system, and an XML-based annotation scheme which serves as an interface between them. The natural language generation system produces linguistic information concerning the text that it generates, and automatically annotates the text with this information. The speech synthesis system then extracts the relevant linguistic information from the annotated text and makes use of it when producing the intonation.
The SOLE system is designed to work as a portable museum guide: visitors to a museum carry a portable device which detects what exhibits they are looking at and gives spoken explanation. SOLE generates its descriptions from a database of the museum exhibits' properties. As it keeps a record of what exhibits have already been visited, it is able to generate descriptions of new exhibits with reference to previous ones. This gives rise to a large number of discourse-level linguistic phenomena such as various types of anaphoric reference (e.g., pronouns, definite descriptions, bridging references) and rhetorical relations (e.g., contrasting two exhibits or amplifying a particular property of an exhibit).
After choosing an initial set of linguistic constructs thought to have some influence on intonation, we developed the XML-based annotation scheme to serve as a general interface between natural language generation and speech synthesis systems, and trained a CART model to recognise correlations between the annotation and accenting so that the synthesis system can make use of this annotation when producing the intonation. As a result, many of the errors that the synthesiser makes with regard to knowing when to accent or deaccent a word are absent in the SOLE output. I will discuss the current results and the implications for text-to-speech systems in cases where it is realistic to use statistical methods for exploiting certain types of high-level linguistic information.