On Methods of Data Standardization of German Social Media Comments





Grammatical Error Correction, LanguageTool, data augmentation, seq2seq, T5, GPT-2


This article is part of a larger project aiming at identifying discursive strategies in social media discourses revolving around the topic of gender diversity, for which roughly 350,000 comments were scraped from the comments sections below YouTube videos relating to the topic in question. This article focuses on different methods of standardizing social media data in order to enhance further processing. More specifically, the data are corrected in terms of casing, spelling, and punctuation. Different tools and models (LanguageTool, T5, seq2seq, GPT-2) were tested. The best outcome was achieved by the German GPT-2 model: It scored highest in all of the applied scores (ROUGE, GLEU, BLEU), making it the best model for the task of Grammatical Error Correction in German social media data.


Download data is not yet available.


Awasthi, Abhijeet, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, and Vihari Piratla. 2019. "Parallel Iterative Edit Models for Local Sequence Transduction." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, November 03-07. Association for Computational Linguistics. 4260-4270. https://doi.org/10.18653/v1/D19-1435

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. "Neural Machine Translation by Jointly Learning to Align and Translate." Paper presented at ICLR 2015, San Diego, California, USA, May 07-09. https://arxiv.org/pdf/1409.0473.pdf.

Bangura, M., K. Barabashova, A. Karnysheva, S. Semczuk, and Y. Wang. 2023. "Automatic Generation of German Drama Texts Using Fine Tuned GPT-2 Models." https://arxiv.org/pdf/2301.03119.pdf

Casas, Noe, José A. R. Fonollosa, and Marta R. Costa-jussà. 2018. "A differentiable BLEU loss. Analysis and first results." Paper presented at ICLR 2018, Vancouver, Canada, April 30-May 03. 1-12. https://openreview.net/pdf?id=HkG7hzyvf

Cho, Kyunghyun, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. "On the Properties of Neural Machine Translation: Encoder-Decoder Approaches." In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, October 25. Association for Computational Linguistics. 103-111. https://doi.org/10.3115/v1/W14-4012

Ge, Tao, Furu Wei, and Ming Zhou. 2018. "Fluency Boost Learning and Inference for Neural Grammatical Error Correction." In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), Melbourne, Australia, July 15-20. Association for Computational Linguistics. 1055-1065. https://doi.org/10.18653/v1/P18-1097

Grundkiewicz, Roman, and Marcin Junczys-Dowmunt. 2014. "The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction." In NLP 2014: Advances in Natural Language Processing, 9th International Conference on NLP, PolTAL 2014, Warsaw, Poland, September 17-19. Springer. 478-490. https://doi.org/10.1007/978-3-319-10888-9_47

Grundkiewicz, Roman, Marcin Junczys-Dowmunt, and Kenneth Heafield. 2019. "Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data." In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, August 02. Association for Computational Linguistics. 252-263. https://doi.org/10.18653/v1/W19-4427

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. "Long Short-Term Memory." Neural Computation 9(8): 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735

HuggingFace. "T5." Accessed June 20, 2023. https://huggingface.co/docs/transformers/model_doc/t5.

Kingma, Diederik P., and Jimmy Lei Ba. 2015. "Adam: A method for stochastic optimization." Paper presented at the 3rd International Conference for Learning Representations, San Diego, California, May 7-9. http://arxiv.org/pdf/1412.6980.pdf

Landis, J. Richard, and Gary G. Koch. 1977. "The Measurement of Observer Agreement for Categorical Data." Biometrics 33(1): 159-174. https://doi.org/10.2307/2529310

LanguageTool. "Development Overview." Accessed June 20, 2023. https://dev.languagetool.org/development-overview.

Lichtarge, Jared, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, and Simon Tong. 2019. "Corpora Generation for Grammatical Error Correction." In Proceedings of NAACL-HLT 2019, Minneapolis, Minnesota, June 02-07. Association for Computational Linguistics. 3291-3301. https://doi.org/10.18653/v1/N19-1333

Lin, Chin-Yew. 2004. "ROUGE: A Package for Automatic Evaluation of Summaries." In Text Summarization Branches Out. Proceedings of the ACL-04 Workshop, Barcelona, Spain, July 25-26. Association for Computational Linguistics. 74-81. https://aclanthology.org/W04-1013.pdf.

Madnani, Nitin, Joel Tetreault, and Martin Chodorow. 2012. "Exploring Grammatical Error Correction with Not-So-Crummy Machine Translation." In NAACL HLT '12: Proceedings of the Seventh Workshop on the Innovative Use of NLP for Building Educational Applications Using NLP, Montréal, Canada, June 03-08. Association for Computational Linguistics. 44-53. doi:10.5555/2390384.2390389

McNamara, Caolan, Németh László, n.a. Pander, and Paweł Hajdan Jr. 2015. "Hunspell." SourceForge. Last modified July 07. https://sourceforge.net/projects/hunspell/

Melnyk, Lidiia, and Linda Feld. 2022. "Sentiment Analysis and Stance Detection on German Youtube Comments on Gender Diversity." Journal of Computer-Assisted Linguistic Research 6: 59-86. https://doi.org/10.4995/jclr.2022.18224

Napoles, Courtney, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. "Ground Truth for Grammatical Error Correction Metrics." In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), Beijing, China, July 26-31. Association for Computational Linguistics. 588-593. https://doi.org/10.3115/v1/P15-2097

Omelianchuk, Kostiantyn, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. "GECToR - Grammatical Error Correction: Tag, Not Rewrite." In Proceedings of the 15th Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA/Online, July 10. Association for Computational Linguistics. 163-170. https://doi.org/10.18653/v1/2020.bea-1.16

Papers with code. "Grammatical Error Correction." Accessed June 20, 2023. https://paperswithcode.com/task/grammatical-error-correction.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. "BLEU: a Method for Automatic Evaluation of Machine Translation." In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, Pennsylvania, USA, July 07-12. Association for Computational Linguistics. 311-318. https://doi.org/10.3115/1073083.1073135

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. "GloVe: Global Vectors for Word Representation." In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October 25-29. Association for Computational Linguistics. 1532-1543. https://doi.org/10.3115/v1/D14-1162

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. "Language Models are Unsupervised Multitask Learners." https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Raffel, Colin, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, and Douglas Eck. 2017. "Online and Linear-Time Attention by Enforcing Monotoni Alignments." In ICML'17: Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, August 06-11. Association for Computing Machinery. 2837-2846. doi:10.5555/3305890.3305974

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research 21(1:140): 1-67. doi:10.5555/3455716.3455856

Rothe, Sascha, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2021. "A Simple Recipe for Multilingual Grammatical Error Correction." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Short Papers), Online, August 01-06. Association for Computational Linguistics. 702-707. https://doi.org/10.18653/v1/2021.acl-short.89

Sahu, Subham, Yogesh Kumar Vishwakarma, Jeevanlal Kori, and Jitendra Singh Thakur. 2020. "Evaluating Performance of Different Grammar Checking Tools." International Journal of Advanced Trends in Computer Science and Engineering 9(2): 2227-2233. https://doi.org/10.30534/ijatcse/2020/201922020

Schmaltz, Allen, Yoon Kim, Alexander M. Rush, Stuart M. Shieber. 2016. "Sentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction." In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, San Diego, California, June 16. Association for Computational Linguistics. 242-251. https://doi.org/10.18653/v1/W16-0528

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. "Sequence to Sequence Learning with Neural Networks." In NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, December 08-13. Association for Computing Machinery. 3104-3112. doi:10.5555/2969033.2969173

Švec, Jan, Jan Lehečka, Luboš Šmídl, and Pavel Ircing. 2021. "Transformer-Based Automatic Punctuation Prediction and Word Casing Reconstruction of the ASR Output." In Text, Speech, and Dialogue: 24th International Conference, TSD 2021, Proceedings, Olomous, Czech Republic, September 06-09. Springer. 86-94. https://doi.org/10.1007/978-3-030-83527-9_7

Torrey, Lisa, and Jude Shavlik. 2009. "Transfer Learning." In Handbook of Research on Machine Learning Applications, edited by E. Soria, J. Martin, R. Magdalena, M. Martinez, and A. Serrano, 242-264. Hershey, PA: IGI Global. https://doi.org/10.4018/978-1-60566-766-9.ch011

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. "Attention Is All You Need." In Advances in Neural Information Processing Systems 30: NIPS 2017, Long Beach, CA, USA, December 04-09. Association for Computing Machinery. 5998-6008. doi:10.48550/arXiv.1706.03762

Wang, Yu, Yuelin Wang, Kai Dang, Jie Liu, and Zhuo Liu. 2021. "A Comprehensive Survey of Grammatical Error Correction." ACM Transition on Intelligent Systems and Technology 12(5:65): 1-51. https://doi.org/10.1145/3474840

Xie, Ziang, Guillaume Genthial, Stanley Xie, Andrew Y. Ng, and Dan Jurafsky. 2018. "Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction." In Proceedings of NAACL-HLT 2018, New Orleans, Louisiana, June 01-06. Association for Computational Linguistics. 619-628. https://doi.org/10.18653/v1/N18-1057

Xue, Linting, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Bara, and Colin Raffel. "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer." In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, June 06-11. Association for Computational Linguistics. 483-498. https://doi.org/10.18653/v1/2021.naacl-main.41