Probabilistic Deep Generation






Project Overview

Investigator: Anja Belz
Researchers: Irene Langkilde-Geary, Eric Kow
Duration: 3 years (start date: 08/05/2007)
Funded by: EPSRC (EP/E029116/1)
More information: Project proposal and workplan.

Acknowledgments:
The support, advice and help received in preparing the original project proposal is gratefully acknowledged, in particular from, in alphabetical order, Terri Epps (Brighton), Roger Evans (Brighton), Gerald Gazdar (Sussex), Rob Koeling (Sussex), Stuart Laing (Brighton), Jerry Loft (Brighton), Ehud Reiter (Aberdeen) and John Taylor (Brighton).

Project summary:
Computational methods for generating language are lagging behind computational methods for analysing language in several ways, most obviously in that they have barely been used commercially. The main reasons for this are that systems for generating language take inordinate amounts of time to build, yet once built cannot be reused, and tend to be severely lacking in language variation which is easily perceived as lacking in quality. The current situation in language generation research is reminiscent of language analysis research in the late 1980s, when symbolic and statistical methods briefly formed entirely separate research paradigms. Language analysis soon moved towards a paradigm merger, realising that symbolic methods lacked the efficiency and robustness that probabilistic methods could provide, which in turn would benefit from the accuracy and subtlety of symbolic methods. A similar development is currently underway in the field of machine translation where – after several years of purely statistical methods dominating the field – researchers are now beginning to bring linguistic knowledge back in. The experience from these research fields suggests that higher quality can be achieved when the symbolic and statistical paradigms join forces. Recent research shows that this is likely to be true for language generation too. The aim of the Prodigy project is to develop, for the first time, a comprehensive, linguistically informed, probabilistic methodology for generating language that substantially improves development time, reusability and variation in language generation systems, and thereby enhances their commercial viability.



Data and software resources
  1. Prodigy-METEO Corpus pre-alpha release; data as used in Belz & Kow, 2009; join Prodigy-METEO user mailing list by emailing Anja Belz (A.S.Belz (at) brighton.ac.uk).
  2. Prodigy-GEO Corpus: due to be released in 2011.


Project publications
  1. Anja Belz and Eric Kow (2010) Extracting Parallel Fragments from Comparable Corpora for Data-to-text Generation. In Proceedings of the 6th International Natural Language Generation Conference (INLG'10), pp. 167-171. [PDF]

  2. Anja Belz and Eric Kow (2010), Assessing the Trade-Off between System Building Cost and Output Quality in Data-to-Text Generation. In Krahmer, E., Theune, M. (eds.) Empirical Methods in Natural Language Generation, Vol. 5980 of Lecture Notes in Computer Science, Springer, pp. 180-200. (Extended version of Belz and Kow, 2009.) [pre-proof PDF]

  3. Anja Belz and Eric Kow (2009), System Building Cost vs. Output Quality in Data-to-Text Generation, Proceedings of the 12th European Workshop on Natural Language Generation (ENLG'09), pp. 16-24. ENLG'09 Best Paper Award. [.pdf]

  4. Anja Belz (2009) Prodigy-METEO: Pre-Alpha Release Notes (Nov 2009). Technical Report NLTG-09-01, Natural Language Technology Group, CMIS, University of Brighton. [PDF]

  5. Irene Langkilde-Geary (2009) A Constraint Programming Approach to Probabilistic Syntactic Processing. In Proceedings of the NAACL'09 Workshop on Integer Linear Programming for Natural Language Processing, pp 36-37. [PDF]

  6. Anja Belz (2008). Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. In Natural Language Engineering, 14(4), pp. 431-455. Cambridge University Press. [preproof: .pdf]

  7. Anja Belz (2007). Probabilistic generation of weather forecast texts. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT'07). [.pdf]

  8. I. Langkilde-Geary. Declarative Syntactic Processing of Natural Language Using Concurrent Constraint Programming and Probabilistic Dependency Modeling. In Proceedings of UCNLG+MT, 2007.

Other research using the Prodigy-METEO Corpus:
  1. Brian Langner (2010) Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. PhD Thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University. [.pdf]

  2. Gabor Angeli, Percy Liang and Dan Klein (2010) A Simple Domain-Independent Probabilistic Approach to Generation. In Proceedings of the 15th Conference on Empirical Methods in Natural Language Processing (EMNLP'10). [.pdf]



Previous publications relevant to Prodigy
  1. Anja Belz (2006). High-quality probabilistic generation of language, Technical Report NLTG-06-02, Natural Language Technology Group, CMIS, University of Brighton. [.pdf]
  2. Anja Belz (2006). pCRU: Probabilistic generation using representational underspecification, Technical Report NLTG-06-01, Natural Language Technology Group, CMIS, University of Brighton. [.pdf]
  3. Anja Belz (2005). Statistical generation: Three methods compared and evaluated. In Proceedings of the 10th European Workshop on Natural Language Generation (ENLG' 05), pp. 15-23. [.pdf]
  4. Anja Belz (2005). Corpus-driven generation of weather forecasts. In Proceedings of the 3rd Corpus Linguistics Conference (CL' 05). [.pdf]
  5. Anja Belz (2004). Context-Free Representational Underspecification for NLG. Technical Report No. ITRI-04-18. Information Technology Research Institute, University of Brighton, UK. [.pdf]
  6. Anja Belz (2004). Underspecification for NLG. In: Belz et al. (eds.) INLG04 Posters: Extended abstracts of posters presented at the 3rd International Conference on Natural Language Generation (INLG 2004), pp. 9-13. Technical Report No. ITRI-04-01. Information Technology Research Institute, University of Brighton, UK. [.pdf]
  7. Anja Belz (2004). Towards a General Framework for Underspecification in NLG and an Underspecification Language for MRS. Technical Report No. ITRI-04-17. Information Technology Research Institute, University of Brighton, UK. [.pdf]
  8. I. Langkilde-Geary. An Exploratory Application of Constraint Optimization in Mozart to Probabilistic Natural Language Processing. In Proceedings of the International Workshop on Constraint Solving and Language Processing (CSLP), Springer-Verlag LNAI, vol. 3438, 2005.
  9. I. Langkilde-Geary and J. Betteridge. A Factored Functional Dependency Transformation of the English Penn Treebank for Probabilistic Surface Generation, Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), 2006.
  10. I. Langkilde-Geary. An Empirical Verification of Coverage and Correctness for a General-Purpose Sentence Generator. Proceedings of the International Natural Language Generation Conference, 2002.
  11. I. Langkilde. Forest-based Statistical Sentence Generation. North American Meeting of the Association for Computational Linguistics (NAACL), 2000.
  12. I. Langkilde, and K. Knight. Generation that Exploits Corpus-based Statistical Knowledge. Proceedings of COLING-ACL, 1998. Workshop Papers
  13. I. Langkilde, and K. Knight. The Practical Value of N-Grams in Generation. International Natural Language Generation Workshop, 1998.



Last modified: Wed Nov 10 10:23:56 GMT 2010 Comments to: A dot S dot Belz at brighton dot ac dot uk