Journal of Computer-Assisted Linguistic Research 2023-12-12T12:09:50+01:00 Carlos Periñán-Pascual Open Journal Systems <p style="text-align: justify; text-justify: inter-ideograph; margin: 0cm 0cm 6.0pt 0cm;"><strong>Journal of Computer-Assisted Linguistic Research (JCLR)</strong> is a double-blind peer-reviewed journal that publishes high-quality scientific articles on linguistic studies where computer tools or techniques play a major role. JCLR aims to promote the integration of computers into linguistic research. In particular, articles in JCLR make a clear contribution to research in which software plays a key role to represent and process written or spoken data. Contributions submitted to JCLR must be in English or Spanish, but we welcome works about the study of any language. Topics of interest include computational linguistics, text mining, natural language processing, discourse analysis, and language-resource construction, among many others.</p> Computer-based Reading Recall on Sociolinguistic Research 2023-07-04T19:24:04+02:00 Camila Franco Rodriguez <p>Global bilingual communities are a fascinating phenomenon that has received constant attention from different angles and disciplines. Sociolinguistic research has also turned interest towards what motivates change in these globalized settings, as well as psycholinguistic research has wanted to focus on the cognitive aspects of L2 speakers. With the widespread use of computer-based methods, it seems natural to add them to contemporary research as a way of understanding variation and change to a deeper level. Through the data I have collected, I debate in this article the importance of including computer-based tests as part of traditional variationist research. I argue that the traditional separation of methods and data collection has influenced the research process to a point where some new behaviors could be overlooked. In this article I report the relationship between cognitive adaptation and social experiences in the Colombian in the Philadelphia bilingual community, which becomes more proficient not only because of age and time of L2 learning, but also because of how welcoming their social circles are, as well as how diverse their friendships and workplaces are.</p> 2023-12-12T00:00:00+01:00 Copyright (c) 2023 Journal of Computer-Assisted Linguistic Research On Methods of Data Standardization of German Social Media Comments 2023-12-11T12:32:35+01:00 Lidiia Melnyk Linda Feld <p>This article is part of a larger project aiming at identifying discursive strategies in social media discourses revolving around the topic of gender diversity, for which roughly 350,000 comments were scraped from the comments sections below YouTube videos relating to the topic in question. This article focuses on different methods of standardizing social media data in order to enhance further processing. More specifically, the data are corrected in terms of casing, spelling, and punctuation. Different tools and models (LanguageTool, T5, seq2seq, GPT-2) were tested. The best outcome was achieved by the German GPT-2 model: It scored highest in all of the applied scores (ROUGE, GLEU, BLEU), making it the best model for the task of Grammatical Error Correction in German social media data.</p> 2023-12-12T00:00:00+01:00 Copyright (c) 2023 Journal of Computer-Assisted Linguistic Research A Lightweight Statistical Method for Terminology Extraction 2023-12-11T12:36:57+01:00 Rogelio Nazar Nicolás Acosta <p>We propose a method for the task of automatic terminology extraction in the context of a larger project devoted to the automation of part of the tasks involved in the production of terminological databases. Terminology extraction is the key to drafting the macrostructure of a terminological resource (i.e., the list of entries), to which information can be later added at the microstructural level with grammatical or semantic information. To this end, we developed a statistical method that is conceptually simple compared to modern neural network approaches. It is a lightweight method because it is based on term dispersion and co-occurrence statistics that can be computed with basic hardware. For the evaluation, we experimented with corpora of lexicography and linguistics in English and Spanish of ca. 66 million tokens. Results improve baselines in almost 20%.</p> 2023-12-12T00:00:00+01:00 Copyright (c) 2023 Journal of Computer-Assisted Linguistic Research Self-supervision of Hallucinations in Large Language Models: LLteaM 2023-12-11T12:37:27+01:00 Sofía Correa Busquets Lucas Maccarini Llorens <p>Large language models like GPT and Claude have revolutionized the tech industry over the past year. However, as generative artificial intelligence, they are prone to hallucinations. A large language model hallucinates when it generates false or nonsensical text. As these models improve, these hallucinations become less obvious and more dangerous for users. This research explores the phenomenon in the context of automated email response for customer service. First, it proposes a taxonomy of hallucinations in large language models based on their linguistic nature, and second, a multi-agent system that allows for the self-supervision of such hallucinations. This system generates email responses but prevents their delivery if hallucinations are detected, thus reducing the risks of generative AI in productive environments. Experiments with various state-of-the-art language models reveal that the only successful model’s operating costs currently exceed those viable for operational deployment. Moreover, a drastic performance drop after a recent update to GPT-3.5-turbo suggests likely shortcomings in industrial applications driven by retrieval-augmented generation. Overall, the research advocates for a Machine Linguistics to analyze the outputs of large language models, suggesting that such a collaboration between Linguistics and Artificial Intelligence could help mitigate the social risks of hallucination.</p> 2023-12-12T00:00:00+01:00 Copyright (c) 2023 Journal of Computer-Assisted Linguistic Research