A Lightweight Statistical Method for Terminology Extraction





automatic terminology extraction, corpus-based terminology processing, computational terminology, information extraction, dispersion measures


We propose a method for the task of automatic terminology extraction in the context of a larger project devoted to the automation of part of the tasks involved in the production of terminological databases. Terminology extraction is the key to drafting the macrostructure of a terminological resource (i.e., the list of entries), to which information can be later added at the microstructural level with grammatical or semantic information. To this end, we developed a statistical method that is conceptually simple compared to modern neural network approaches. It is a lightweight method because it is based on term dispersion and co-occurrence statistics that can be computed with basic hardware. For the evaluation, we experimented with corpora of lexicography and linguistics in English and Spanish of ca. 66 million tokens. Results improve baselines in almost 20%.


Download data is not yet available.


Ahmad, Khurshid, Gillam, Lee, and Tostevin, Lena. 1999. "University of Surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder)". In TREC, volume 500-246 of NIST Special Publication. National Institute of Standards and Technology (NIST).

Aker, Ahmet, Paramita, Monica and Gaizauskas, Rob. 2013. "Extracting bilingual terminologies from comparable corpora". In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 402-411.

Anthony, Laurence. 2005. "Antconc: Design and development of a freeware corpus analysis toolkit for the technical writing classroom". In 2005 IEEE International Professional Communication Conference Proceedings, IPCC 2005: 729-737.

Arntz, Reiner, and Picht, Heribert. 1995. Introducción a la Terminología. Madrid: Pirámide. Fundación Germán Sánchez Rupiérez.

Baisa, Vít, Michelfeit, Jan, and Matuška, Ond?rej. 2017. "Simplifying terminology extraction: Oneclick terms". Paper presented at Corpus Linguistics 2017 Conference, University of Birmingham, July 25-28, 2017. https://www.birmingham.ac.uk/documents/college-artslaw /corpus/conference-archives/2017/general/paper385.pdf

Bordea, Georgeta, Buitelaar, Paul, Faralli, Stefano, and Navigli, Roberto. 2015. "Semeval-2015 task 17: Taxonomy extraction evaluation (texeval)". In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015): 902-910. Denver, Colorado: Association for Computational Linguistics. https://doi.org/10.18653/v1/S15-2151

Bourigault, Didier, Gonzalez-Mullier, Isabelle, and Gros, Cécile. 1996. "Lexter, a natural language processing tool for terminology extraction". In Proceedings of the 7th EURALEX International Congress: 771-779. Göteborg: Novum Grafiska AB.

Cabré, María Teresa, Estopà, Rosa, and Vivaldi, Jorge. 2001. "Automatic term detection: A review of current systems". In Recent Advances in Computational Terminology, edited by Didier Bourigault, Christian Jacquemin, and Marie-Claude L'Homme, 53-87. Amsterdam: John Benjamins. https://doi.org/10.1075/nlp.2.04cab

Cabré, María Teresa. 1999. La Terminología: Representación y Comunicación. Barcelona: Institut Universitari de Lingüística Aplicada.

Conrado, Merley, Pardo, Thiago, and Rezende, Solange. 2013. "A machine learning approach to automatic term extraction using a rich feature set". In Proceedings of the 2013 NAACL HLT Student Research Workshop: 16-23, Atlanta, Georgia: Association for Computational Linguistics.

Cram, Damien, and Daille, Béatrice. 2016. "Terminology extraction with term variant detection". In Proceedings of ACL-2016 system demonstrations: 13-18. https://doi.org/10.18653/v1/P16-4003

Daille, Béatrice. 1994. "Approche mixte pour l'extraction de terminologie: statistique lexicale et filtres linguistiques". PhD dissertation. Université Paris Diderot.

de Schryver, Gilles-Maurice, and Joffe, David. 2023. "The end of lexicography, welcome to the machine: On how chatGPT can already take over all of the dictionary maker's tasks". Talk presented at 20th CODH Seminar, at Center for Open Data in theHumanities, Research Organization of Information and Systems, National Institute of Informatics,Tokyo.

Drouin, Patrick. 2003. "Term extraction using non technical corpora as a point of leverage". Terminology, 9(1): 99-115. https://doi.org/10.1075/term.9.1.06dro

Felber, Helmut. 1984. Terminology Manual. Paris: United Nations Educational, Scientific and Cultural Organization, International Information Centre for Terminology.

Filippova, Darya, Can, Burcu, and Corpas Pastor, Gloria. 2021. "Bilingual terminology extraction using neural word embeddings on comparable corpora". In Proceedings of the Student Research Workshop Associated with RANLP 2021: 58-64. https://doi.org/10.26615/issn.2603-2821.2021_009

Firth, John. 1957. "A synopsis of linguistic theory, 1930-55". In Studies in Linguistic Analysis, 1-31. Oxford: Blackwell.

Frantzi, Katerina, Ananiadou, Sophia, and Mima, Hideki. 2000. "Automatic recognition of multi-word terms: The c-value/nc-value method". International Journal on Digital Libraries, 3(2): 115-130. https://doi.org/10.1007/s007999900023

Haque, Rejwanul, Penkale, Sergio, and Way, Andy. 2018. "Termfinder: Log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction". Language Resources and Evaluation, 52(2): 365-400. https://doi.org/10.1007/s10579-018-9412-4

Harris, Zellig. 1954. "Distributional structure". Word, 10(2-3): 146-162. https://doi.org/10.1080/00437956.1954.11659520

Hearst, Marti A. 1992. "Automatic acquisition of hyponyms from large text corpora". In COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics. https://doi.org/10.3115/992133.992154

Heylen, Kris, and De Hertog, Dirk. 2015. "Automatic term extraction". In Handbook of Terminology. Volume 1, edited by Hendrik J. Kockaert and Frieda Steurs, 203-221. Amsterdam: John Benjamins. https://doi.org/10.1075/hot.1.11aut1

Humbley, John. 2022. "The reception of Wüster's general theory of terminology". In Theoretical Perspectives on Terminology. Explaining Terms, Concepts and Specialized Knowledge, edited by Pamela Faber and Marie-Claude L'Homme, 15-36. Amsterdam: John Benjamins. https://doi.org/10.1075/tlrp.23.01hum

Hutchins, John. 1998. "The origins of the translator's workstation". Machine Translation, 13(4): 287-307. https://doi.org/10.1023/A:1008123410206

Justeson, John S., and Katz, Slava M. 1995. "Technical terminology: Some linguistic properties and an algorithm for identification in text". Natural Language Engineering, 1(1): 9-27. https://doi.org/10.1017/S1351324900000048

Kageura, Kyo and Umino, Bin. 1996. "Methods of automatic term recognition: A review". Terminology, 3(1): 259-289. https://doi.org/10.1075/term.3.2.03kag

Lang, Christian, Wachowiak, Lennart, Heinisch, Barbara, and Gromann, Dagmar. 2021. "Transforming term extraction: Transformer-based approaches to multilingual term extraction across domains". In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021: 3607-3620. https://doi.org/10.18653/v1/2021.findings-acl.316

Lefever, Els, Macken, Lieve and Hoste, Veronique. 2009. "Language-independent bilingual terminology extraction from a multilingual parallel corpus". In Proceedings of 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), 496-504. https://doi.org/10.3115/1609067.1609122

Lindemann, David, Kliche, Fritz, and Heid, Ulrich. 2018. "Lexbib: A Corpus and Bibliography of Metalexico graphical Publications". In Proceedings of EURALEX 2018, 699-712.

Meyer, Ingrid. 2001. "Extracting knowledge-rich contexts for terminography: A conceptual and methodological framework". In Recent Advances in Computational Terminology, edited by Didier Bourigault, Christian Jacquemin, and Marie-Claude L'Homme, 279-302. Amsterdam: John Benjamins. https://doi.org/10.1075/nlp.2.15mey

OpenAI. 2023. "Gpt-4 technical report". Last revised March 27, 2023. arXiv:2303.08774 [cs.CL].

Pavel, Silvia and Nolet, Diane. 2002. Manual de Terminología. Translation Bureau. Québec: Public Works and Government Services.

Pearson, Jennifer. 1998. Terms in Context. Amsterdam: John Benjamins. https://doi.org/10.1075/scl.1

QasemiZadeh, Behrang and Schumann, Anne-Kathrin. 2016. "The ACL RD-TEC 2.0: A language resource for evaluating term extraction and entity recognition methods". In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portorož. European Language Resources Association (ELRA).

Quasthoff, Uwe, Goldhahn, Dirk, and Eckart, Thomas. 2014. "Building large resources for text mining: The Leipzig Corpora Collection". In Text Mining: From Ontology Learning to Automated Text Processing Applications, edited by Chris Biemann and Alexander Mehler, 3-24. Cham: Springer. https://doi.org/10.1007/978-3-319-12655-5_1

Rigouts Terryn, Ayla, Hoste, Veronique, Drouin, Patrick, and Lefever, Els. 2020. "TermEval 2020: Shared task on automatic term extraction using the annotated corpora for term extraction research (ACTER) dataset". In Proceedings of the 6th International Workshop on Computational Terminology, 85-94, Marseille. European Language Resources Association.

Rigouts Terryn, Ayla, Hoste, Veronique, and Lefever, Els. 2022. "D-terminer: online demo for monolingual and bilingual automatic term extraction". In Proceedings of the Workshop on Terminology in the 21st century: 33-40.

Sager, Juan C. 1990. A Practical Course in Terminology Processing. Amsterdam: John Benjamins. https://doi.org/10.1075/z.44

de Schryver, Gilles-Maurice, and Joffe, David. 2023. "The end of lexicography, welcome to the machine: On how chatGPT can already take over all of the dictionary maker's tasks". Talk presented at 20th CODH Seminar, at Center for Open Data in the Humanities, Research Organization of Information and Systems, National Institute of Informatics, Tokyo.

Shwartz, Vered, Santus, Enrico, and Schlechtweg, Dominik. 2017. "Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection". In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers: 65-75. https://doi.org/10.18653/v1/E17-1007

Simões, Alberto and Almeida, José João. 2008. "Bilingual terminology extraction based on translation patterns". Procesamiento del Lenguaje Natural, (41): 281-288.

Spärck Jones, Karen. 1972. "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation, 28: 11-21. https://doi.org/10.1108/eb026526

Steurs, Frieda, De Wachter, Ken, and De Malsche, Evy. 2015. "Terminology tools". In Handbook of Terminology. Volume 1, edited by Hendrik J. Kockaert and Frieda Steurs, 222-249. Amsterdam: John Benjamins. https://doi.org/10.1075/hot.1.12ter3

Straka, Milan and Straková, Jana. 2017. "Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe". In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies: 88-99. Vancouver: Association for Computational Linguistics. https://doi.org/10.18653/v1/K17-3009

Tran, Hanh Thi Hong, Martinc, Matej, Caporusso, Jaya, Doucet, Antoine, and Pollak, Senja. 2023. "The recent advances in automatic term extraction: A survey". arXiv:2301.06767 [cs.CL].

Ville-Ometz, Fabienne, Royauté, Jean, and Zasadzinski, Alain. 2007. "Enhancing in automatic recognition and extraction of term variants with linguistic features". Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 13(1): 35-59. https://doi.org/10.1075/term.13.1.03vil

Wüster, Eugen. 1979. Einführung in die allgemeine Terminologielehre und terminologische Lexikographie. Wien: Springer.

Zhang, Chunxia and Jiang, Peng. 2009. "Automatic extraction of definitions". In Proceedings of the 2009 2nd IEEE International Conference on Computer Science and Information Technology, ICCSIT 2009: 364-368. https://doi.org/10.1109/ICCSIT.2009.5234687