Journal of Computer-Assisted Linguistic Research https://polipapers.upv.es/index.php/jclr <p style="text-align: justify; text-justify: inter-ideograph; margin: 0cm 0cm 6.0pt 0cm;"><strong>Journal of Computer-Assisted Linguistic Research (JCLR)</strong> is a double-blind peer-reviewed journal that publishes high-quality scientific articles on linguistic studies where computer tools or techniques play a major role. JCLR aims to promote the integration of computers into linguistic research. In particular, articles in JCLR make a clear contribution to research in which software plays a key role to represent and process written or spoken data. Contributions submitted to JCLR must be in English or Spanish, but we welcome works about the study of any language. Topics of interest include corpus linguistics, computational linguistics, text mining, natural language processing, knowledge representation, discourse analysis, and language-resource construction, among many others.</p> Universitat Politècnica de València en-US Journal of Computer-Assisted Linguistic Research 2530-9455 <p><a href="http://creativecommons.org/licenses/by-nc-nd/4.0/" rel="license"><img src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" alt="Creative Commons License" /></a><br />This journal is licensed under <a href="http://creativecommons.org/licenses/by-nc-nd/4.0/" rel="license">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a></p> A Comprehensive Review of Sign Language Production https://polipapers.upv.es/index.php/jclr/article/view/20983 <p>Sign languages are made up of phonological, morphological, syntactic and semantic levels of structure that satisfy the same social, cognitive and communicative purposes as other natural languages and represent the most used form of communication between hearing and deaf people. Sign Language Production together with Sign Language Recognition constitute the two parts of this process, as Sign Language Production concerns that part of the communication process that goes from spoken language to its translation into Sign Language, while Sign Language Recognition deals with the recognition of Sign Language. In this article, we want to consider some of the most recent and important studies on Sign Language Production and discuss their limitations, advantages, and future developments.</p> Franco Tuveri Copyright (c) 2024 Journal of Computer-Assisted Linguistic Research https://creativecommons.org/licenses/by-nc-nd/4.0 2024-11-15 2024-11-15 8 1 22 10.4995/jclr.2024.20983 FAQ-Gen: An automated system to generate domain-specific FAQs to aid content comprehension https://polipapers.upv.es/index.php/jclr/article/view/21178 <p>Frequently Asked Questions (FAQs) refer to the most common inquiries about specific content. They serve as content comprehension aids by simplifying topics and enhancing understanding through succinct presentation of information. In this paper, we address FAQ generation as a well-defined Natural Language Processing task through the development of an end-to-end system leveraging text-to-text transformation models. We present a literature review covering traditional question-answering systems, highlighting their limitations when applied directly to the FAQ generation task. We propose a system capable of building FAQs from textual content tailored to specific domains, enhancing their accuracy and relevance. We utilise self-curated algorithms to obtain an optimal representation of information to be provided as input and also to rank the question-answer pairs to maximise human comprehension. Qualitative human evaluation showcases the generated FAQs as well-constructed and readable while also utilising domain-specific constructs to highlight domain-based nuances and jargon in the original content.</p> Sahil Kale Gautam Khaire Jay Patankar Copyright (c) 2024 Journal of Computer-Assisted Linguistic Research https://creativecommons.org/licenses/by-nc-nd/4.0 2024-11-15 2024-11-15 8 23 49 10.4995/jclr.2024.21178 Does ChatGPT have sociolinguistic competence? https://polipapers.upv.es/index.php/jclr/article/view/21958 <p>Large language models are now able to generate content- and genre-appropriate prose with grammatical sentences. However, these targets do not fully encapsulate human-like language use. For example, set aside is the fact that human language use involves sociolinguistic variation that is regularly constrained by internal and external factors. This article tests whether one widely used LLM application, ChatGPT, is capable of generating such variation. I construct an English corpus of “sociolinguistic interviews” using the application and analyze the generation of seven morphosyntactic features. I show that the application largely fails to generate any variation at all when one variant is prescriptively incorrect, but that it is able to generate variable deletion of the complementizer <em>that</em> that is internally constrained, with variants occurring at human-like rates. ChatGPT fails, however, to properly generate externally constrained complementizer <em>that </em>deletion. I argue that these outcomes reflect bias both in the training data and Reinforcement Learning from Human Feedback. I suggest that testing whether an LLM can properly generate sociolinguistic variation is a useful metric for evaluating if it generates human-like language.</p> Daniel Duncan Copyright (c) 2024 Journal of Computer-Assisted Linguistic Research https://creativecommons.org/licenses/by-nc-nd/4.0 2024-11-15 2024-11-15 8 51 75 10.4995/jclr.2024.21958 Two linguistic levels of lexical ambiguity and a unified categorical representation https://polipapers.upv.es/index.php/jclr/article/view/22348 <p>Lexical disambiguation is one of the oldest problems in natural language processing. There are three main types of lexical ambiguity: part-of-speech ambiguity, homonymy, and polysemy, typically divided into two tasks in practice. While this division suffices for engineering purposes, it does not align well with human intuition. In this article, I use lexical ambiguity as a representative case to demonstrate how insights from theoretical linguistics can be helpful for developing more human-like meaning and knowledge representations in natural language understanding. I revisit the three types of lexical ambiguity and propose a structured reclassification of them into two levels using the theoretical linguistic tool of root syntax. Recognizing the uneven expressive power of root syntax across these levels, I further translate the theoretical linguistic insights into the language of category theory, mainly using the tool of topos. The resulting unified categorical representation of lexical ambiguity preserves rootsyntactic insights, has strong expressive power at both linguistic levels, and can potentially serve as a bridge between theoretical linguistics and natural language understanding. </p> Chenchen Song Copyright (c) 2024 Journal of Computer-Assisted Linguistic Research https://creativecommons.org/licenses/by-nc-nd/4.0 2024-11-15 2024-11-15 8 77–107 77–107 10.4995/jclr.2024.22348