Journal of Computer-Assisted Linguistic Research <p style="text-align: justify; text-justify: inter-ideograph; margin: 0cm 0cm 6.0pt 0cm;"><strong>Journal of Computer-Assisted Linguistic Research (JCLR)</strong> is a double-blind peer-reviewed journal that publishes high-quality scientific articles on linguistic studies where computer tools or techniques play a major role. JCLR aims to promote the integration of computers into linguistic research. In particular, articles in JCLR make a clear contribution to research in which software plays a key role to represent and process written or spoken data. Contributions submitted to JCLR must be in English or Spanish, but we welcome works about the study of any language. Topics of interest include computational linguistics, text mining, natural language processing, discourse analysis, and language-resource construction, among many others.</p> Universitat Politècnica de València en-US Journal of Computer-Assisted Linguistic Research 2530-9455 <p><a href="" rel="license"><img src="" alt="Creative Commons License" /></a></p> <p>This journal is licensed under <a href="" rel="license">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a></p> Classifying the Evolving Mask Debate: A Transferable Machine Learning Framework <p>Anti-maskers represent a community of people that opposes the use of face masks on grounds that they infringe personal freedoms. This community has thoroughly exploited the convenience and reach of online social media platforms such as Facebook and Twitter to spread discordant information about the ineffectiveness and harm caused by masks in order to persuade people to shun their use. Automatic detection and demoting of anti-mask tweets is thus necessary to limit their damage. This is challenging because the mask dialogue continuously evolves with creative arguments that embed emerging knowledge about the virus, changing socio-political landscape, and present policies of public health officers and organizations. Therefore, this paper builds a transferrable machine learning framework that can separate between anti-mask and pro-mask tweets from longitudinal data collected at four epochs during the pandemic. The framework extracts content, emotional, and engagement features that faithfully capture the patterns that are relevant to anti-mask rhetoric, but ignores those related to contextual details. It trains two ensemble learners and two neural network architectures using these features. Ensemble classifiers can identify anti-mask tweets with approximately 80% accuracy and F1-score from both individual and combined data sets. The invariant linguistic features extracted by the framework can thus form the basis of automated classifiers that can efficiently separate other types of falsehoods and misinformation from huge volumes of social media data.</p> Julia Warnken Swapna S. Gokhale Copyright (c) 2022 Journal of Computer-Assisted Linguistic Research 2022-11-23 2022-11-23 6 1 18 10.4995/jclr.2022.17493 Character Extraction and Character Type Identification from Summarised Story Plots <p>Identifying the characters from free-form text and understanding the roles and relationships between them is an evolving area of research. They have a wide range of applications, from summarising narrations to understanding the social network from social media tweets, which can help in automation and improve the experience of AI systems like chatbots and much more. The aim of this research is twofold. Firstly, we aim to develop an effective method of extracting characters from a story summary, to develop a set of relevant features, then, using supervised learning algorithms, to identify the character types. Secondly, we aim to examine the efficacy of unsupervised learning algorithms in type identification, as it is challenging to find a dataset with a predetermined list of characters, roles, and relationships that are essential for supervised learning. To do so, we used summary plots of fictional stories to experiment and evaluate our approach. Our character extraction approach successfully improved on the performance reported by existing work, with an average F1-score of 0.86. Supervised learning algorithms successfully identified the character types and achieved an overall average F1-score of 0.94. However, the clustering algorithms identified more than three clusters, indicating that more research is needed to improve their efficacy.</p> Vardhini Srinivasan Aurelia Power Copyright (c) 2022 Journal of Computer-Assisted Linguistic Research 2022-11-23 2022-11-23 6 19 41 10.4995/jclr.2022.17835 Extracting Features from Textual Data in Class Imbalance Problems <p>We address class imbalance problems. These are classification problems where the target variable is binary, and one class dominates over the other. A central objective in these problems is to identify features that yield models with high precision/recall values, the standard yardsticks for assessing such models. Our features are extracted from the textual data inherent in such problems. We use n-gram frequencies as features and introduce a discrepancy score that measures the efficacy of an n-gram in highlighting the minority class. The frequency counts of n-grams with the highest discrepancy scores are used as features to construct models with the desired metrics. According to the best practices followed by the services industry, many customer support tickets will get audited and tagged as “contract-compliant” whereas some will be tagged as “over-delivered”. Based on in-field data, we use a random forest classifier and perform a randomized grid search over the model hyperparameters. The model scoring is performed using an scoring function. Our objective is to minimize the follow-up costs by optimizing the recall score while maintaining a base-level precision score. The final optimized model achieves an acceptable recall score while staying above the target precision. We validate our feature selection method by comparing our model with one constructed using frequency counts of n-grams chosen randomly. We propose extensions of our feature extraction method to general classification (binary and multi-class) and regression problems. The discrepancy score is one measure of dissimilarity of distributions and other (more general) measures that we formulate could potentially yield more effective models.</p> Sarang Aravamuthan Prasad Jogalekar Jonghae Lee Copyright (c) 2022 Journal of Computer-Assisted Linguistic Research 2022-11-23 2022-11-23 6 42 58 10.4995/jclr.2022.18200 Sentiment Analysis and Stance Detection on German YouTube Comments on Gender Diversity <p>This paper explores different options of detecting the stance of German YouTube comments regarding the topic of gender diversity and compares the respective results with those of sentiment analysis, showing that these are two very different NLP tasks focusing on distinct characteristics of the discourse. While an already existing model was used to analyze the comments’ sentiment (BERT), the comments’ stance was first annotated and then used to train different models – SVM with TF-IDF, DistilBERT, LSTM and CNN – for predicting the stance of unseen comments. The best results were achieved by the CNN, reaching 78.3% accuracy (92% after dataset normalization) on the test set. Whereas the most common stance identified in the comments is a neutral one (neither completely in favor nor completely against gender diversity), the overall sentiment of the discourse turns out to be negative. This shows that the discourse revolving around the topic of gender diversity in YouTube comments is filled with strong opinions, on the one hand, but also opens up a space for anonymously inquiring and learning about the topic and its implications, on the other. Our research thereby (1) contributes to the understanding and application of different NLP tasks used to predict the sentiment and stance of unstructured textual data, and (2) provides relevant insights into society’s attitudes towards a changing system of values and beliefs.</p> Lidiia Melnyk Linda Feld Copyright (c) 2022 Journal of Computer-Assisted Linguistic Research 2022-11-23 2022-11-23 6 59 86 10.4995/jclr.2022.18224 A Knowledge-Based Model for Polarity Shifters <p>Polarity shifting can be considered one of the most challenging problems in the context of Sentiment Analysis. Polarity shifters, also known as <em>contextual valence shifters</em> (Polanyi and Zaenen 2004), are treated as linguistic contextual items that can increase, reduce or neutralise the prior polarity of a word called <em>focus</em> included in an opinion. The automatic detection of such items enhances the performance and accuracy of computational systems for opinion mining, but this challenge remains open, mainly for languages other than English. From a symbolic approach, we aim to advance in the automatic processing of the polarity shifters that affect the opinions expressed on tweets, both in English and Spanish. To this end, we describe a novel knowledge-based model to deal with three dimensions of contextual shifters: <em>negation</em>, <em>quantification</em>, and <em>modality</em> (or irrealis).</p> Yolanda Blázquez-López Copyright (c) 2022 Journal of Computer-Assisted Linguistic Research 2022-11-23 2022-11-23 6 87 107 10.4995/jclr.2022.18807