Linguistic challenges in automatic summarization technology


  • Elke Diedrichsen Institute of Technology Blanchardstown



automatic summarization, natural language processing, linguistics, syntax, discourse


Automatic summarization is a field of Natural Language Processing that is increasingly used in industry today. The goal of the summarization process is to create a summary of one document or a multiplicity of documents that will retain the sense and the most important aspects while reducing the length considerably, to a size that may be user-defined. One differentiates between extraction-based and abstraction-based summarization. In an extraction-based system, the words and sentences are copied out of the original source without any modification. An abstraction-based summary can compress, fuse or paraphrase sections of the source document. As of today, most summarization systems are extractive. Automatic document summarization technology presents interesting challenges for Natural Language Processing. It works on the basis of coreference resolution, discourse analysis, named entity recognition (NER), information extraction (IE), natural language understanding, topic segmentation and recognition, word segmentation and part-of-speech tagging. This study will overview some current approaches to the implementation of auto summarization technology and discuss the state of the art of the most important NLP tasks involved in them. We will pay particular attention to current methods of sentence extraction and compression for single and multi-document summarization, as these applications are based on theories of syntax and discourse and their implementation therefore requires a solid background in linguistics. Summarization technologies are also used for image collection summarization and video summarization, but the scope of this paper will be limited to document summarization.


Download data is not yet available.

Author Biography

Elke Diedrichsen, Institute of Technology Blanchardstown

Computational and Functional Linguistics Research Group


Althaus, Ernst, Nikiforos Karamanis and Alexander Koller. 2004. Computing locally coherent discourses. Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, 399-406.

Baum, Leonard E. 1972. An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3: 1-8.

Baum, Leonard E., Ted Petrie, George Soules, and Norman Weiss. 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics 41: 1, 164-171.

Chin-Yew Lin and Eduard Hovy. 2000. The automated acquisition of topic signatures for text summarization. Proceedings of the International Conference on Computational Linguistics, 495-501.

Clarke, James and Mirella Lapata 2010. Discourse Constraints for Document Compression. Computational Linguistics 36: 3, 411-441.

Clarke, James and Mirella Lapata. 2007. Modelling Compression with Discourse Constraints. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 1-11.

Cohn, Trevor and Mirella Lapata. 2009. Sentence compression as tree transduction. Journal of Artificial Intelligence Research, 34: 637–674.

Conroy, John M., Judith D. Schlesinger and Jade Goldstein Stewart. 2005. CLASSY query-based multi-document summarization. DUC 05 Conference Proceedings.

Conroy, John M., Judith D. Schlesinger, Dianne P. O’Leary and Jade Goldstein. 2006. Back to Basics: CLASSY 2006. Proceedings of the Document Understanding Conference, 2006.

Corston-Oliver, Simon. 2001. Text compaction for display on very small screens. Proceedings of the NAACL Workshop on Automatic Summarization, 89–98.

Diedrichsen, Elke (2016): Does NLP need Theoretical Linguistics? In Periñan-Pascual, Carlos and Eva M. Mestre-Mestre (eds.): Understanding Meaning and Knowledge Representation: From Theoretical and Cognitive Linguistics to Natural Language Processing. Newcastle Upon Tyne: Cambridge Scholars Publishing, 249-258.

Diedrichsen, Elke. 2014. A Role and Reference Grammar Parser for German. In Nolan, Brian and Carlos Periñan-Pascual (eds.): Language Processing and Grammars. The Role of Functionally Oriented Computational Models. Amsterdam: John Benjamins, 105-142.

Filatova, Elena and Vasileios Hatzivassiloglou. 2004. A formal model for information selection in multi-sentence text extraction. Proceedings of the International Conference on Computational Linguistics, 397-403.

Grefenstette, Gregory. 1998. Producing Intelligent Telegraphic Text Reduction to Provide an Audio Scanning Service for the Blind. Proceedings of the AAAI Symposium on Intelligent Text Summarization, 111–117.

Grosz, Barbara J., Aravind K. Joshi and Scott Weinstein. 1994. Centering: a framework for modeling the local coherence of discourse. University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS: 94-40, 1-27.

Halliday, M. A. K. and Ruqaiya Hasan. 1976. Cohesion in English. Longman, London.

Hatzivassiloglou, Vasileios, Judith L. Klavans, Melissa L. Holcombe, Regina Barzilay, Min-Yen Kan and Kathleen R. McKeown. 2001. SIMFINDER: A flexible clustering tool for summarization. Proceedings of the NAACL Workshop on Automatic Summarization, 41-49.

Jing, Hongyan and Kathleen R. McKeown. 2000. Cut and paste based text summarization. Proceedings of the North American chapter of the Association for Computational Linguistics Conference, 178-185.

Jing, Hongyan. 2000. Sentence reduction for automatic text summarization. Proceedings of the Conference on Applied Natural Language Processing, 310-315.

Jurafsky, Daniel and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2nd edition. Pearson Education Inc.

Lewis, David Dolan. 1999. An evaluation of phrasal and clustered representations on a text categorization task. Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1992), 37-50.

Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2: 2, 159-165.

Mani, Inderjeet. 2001. Summarization Evaluation: An Overview. Proceedings of the NTCIR Workshop 2 Meeting on Evaluation of Chinese and Japanese Text Retrieval and Text Summarization. Tokyo: National Institute of Informatics.

McDonald, Ryan. 2006. Discriminative sentence compression with soft syntactic constraints. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 297–304.

Morris, Jane and Graeme Hirst. 1991. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1): 21–48.

Nolan, Brian (2016): What can Theoretical Linguistics do for Natural Language Processing Research? In Periñan-Pascual, Carlos and Eva M. Mestre-Mestre (eds.): Understanding Meaning and Knowledge Representation: From Theoretical and Cognitive Linguistics to Natural Language Processing. Newcastle Upon Tyne: Cambridge Scholars Publishing, 235-248.

Quazvinian, Vahed and Dragomir R. Radev. 2008. Scientific paper summarization using citation summary networks. Proceedings of the International Conference on Computational Linguistics, 689-696.

Rabiner, Lawrence E. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77: 2, 257-286.

Schwartz, Richard, Toru Imai, Francis Kubala, Long Nguyen and John Makhoul. 1997. A maximum likelihood model for topic classification of broadcast news. Proceedings of the Fifth European Speech Communication Association Conference on Speech Communication and Technology (Eurospeech-97).

Siddarthan, Advaith, Ani Nenkova, and Kathleen R. McKeown. 2004. Syntactic simplification for improving content selection in multi-document summarization. Proceedings of the International Conference on Computational Linguistics, 896-902.

Sista, Sreenivasa, Schwartz, Richard, Leek, Timothy R. and John Makhoul. 2002. An algorithm for unsupervised topic discovery from broadcast news stories. Proceedings of the 2002 Human Language Technology Conference (HLT), 99-103.

Spärck Jones, Karen. 2007. Automatic summarising: The state of the art. In Information Processing and Management. 43 (2007) 1449–1481.

Van Valin, Robert D. Jr. 2005. Exploring the Syntax-Semantics Interface. Cambridge: Cambridge University Press.

Yih, Wen Tau, Joshua Goodman, Lucy Vanderwende and Hisami Suzuki. 2007. Multi-document summarization by maximizing informative content-words. Proceedings of the International Joint Conference on Artificial Intelligence, 1776-1782.

Zajic, David, Bonnie J. Dorr, and Richard Schwartz. 2004. BNN/UMD at DUC-2004: Topiary. Proceedings of the 2004 Document Understanding Conference (DUC 2004) at NLT/NAACL 2004, 112-119.

Zajic, David, Bonnie J. Dorr, Jimmy Lin and Richard Schwartz. 2007. Multi-candidate reduction: Sentence compression as a tool for document summarization tasks. Information Processing and Management, 43:6, 1549-1570.