Thematic vocabulary selection for didactic purposes: evaluation of a quantitative approach


  • Jasper Degraeuwe Universiteit Gent
  • Patrick Goethals Universiteit Gent



corpus linguistics, vocabulary learning, automatic vocabulary selection, thematic vocabulary selection, absolute frequency, keyness, dispersion, Spanish as a foreign language (SFL)


The aim of this study is to evaluate the results of a quantitative approach to the thematic selection of vocabulary for didactic purposes. We describe in detail how three quantitative measures (absolute frequency, keyness and dispersion) are configured and combined to automate the selection of specific vocabulary from a specialized corpus. We then evaluate whether the automatic selection is confirmed by the judgements of SFL teachers. The results of this evaluation experiment show that in more than 85% of the cases the output of the quantitative selection method is accepted by at least half of the teachers. This observation is also backed from a statistical angle, with the outcome of an interrater reliability test indicating that there is a substantial agreement (Cohen’s kappa = 0.61) between the judgements of the teachers and the automatic selection.


Download data is not yet available.


Biber, D., Connor, U. y Upton, T. A. (2007). Discourse on the move: using corpus analysis to describe discourse structure. Ámsterdam: John Benjamins.

Boulton, A. (2017). "Data-Driven Learning and Language Pedagogy", en S. L. Thorne & S. May (eds.), Language, Education and Technology, Encyclopedia of Language and Education. Berlín & Heidelberg: Springer International Publishing, 181-192.

Bowker, L. y Pearson, J. (2002). Working with specialized language: a practical guide to using corpora. Londres & Nueva York: Routledge.

Buyse, K., Delbecque, N. y Speelman, D. (2004). Portavoces. Thematische woordenschat Spaans. Malinas: Wolters Plantyn.

Davies, M. (2006). A frequency dictionary of Spanish: Core vocabulary for learners. Nueva York: Routledge.

Gabrielatos, C. y Marchi, A. (2011). "Keyness: Matching metrics to definitions" (Contribución presentada en the Corpus Linguistics in the South), Portsmouth, NH.

García Salido, M. y Alonso Ramos, M. (2018). "Asignación de niveles de aprendizaje a las colocaciones del Diccionario de Colocaciones del español", Revista signos, 51/97, 153-174.

Goethals, P. (2018). "Customizing vocabulary learning for advanced learners of Spanish", en T. Read, B. Sedano Cuevas y S. Montaner-Villalba (Eds.), Technological innovation for specialized linguistic domains (pp. 229- 240). Berlin: Éditions Universitaires Européennes.

Goethals, P., Tezcan, A. y Degraeuwe, J. (2019). "Vocabulary selection for didactic purposes: report on a machine learning approach". Argentinian Journal of Applied Linguistics, 7/2, 34-51.

Goethals, P., Lefever, E. y Macken, L. (2017). "SCAP_tur: Tagging and lemmatising Spanish tourism discourse, and beyond". Ibérica, 33, 279-288.

Gries, S. T. (2008). "Dispersions and adjusted frequencies in corpora", International Journal of Corpus Linguistics, 13, 403-437.

Izquierdo Gil, M. d. C. (2005). La selección de léxico en la enseñanza del español como lengua extranjera. Su aplicación al nivel elemental en estudiantes francófonos. Málaga: ASELE Colección Monografías.

Krippendorff, K. (2004). Content analysis: An introduction to its methodology. Sage, California: Thousand Oaks.

Landis, J.R. y Koch, G.G. (1977). "The measurement of observer agreement for categorical data", Biometrics, 33, 159-174.

Laufer, B., Meara, P. y Nation, P. (2005). "Ten best ideas for teaching vocabulary", The Language Teacher, 29/7, 36.

Nation, P. (2016). Making and Using Word Lists for Language Learning and Testing. John Benjamins.

Oakes, M. P. y Farrow, M. (2007). "Use of the chi-squared test to examine vocabulary differences in English-language corpora representing seven different countries", Literary and Linguistic Computing, 22/1, 85100.

Okamoto, M. (2015). "Is corpus word frequency a good yardstick for selecting words to teach? Threshold levels for vocabulary selection", System, 51, 1-10.

Schmitt, N. (2008). "Review article: Instructed second language vocabulary learning", Language Teaching Research, 12/3, 329-363.

Scott, M. (1996). WordSmith Tools Manual. Oxford: Oxford University Press.

Scott, M. (1997). "PC analysis of key words - and key key words", System, 25/2, 233-245.

Sinclair, J. (2005). "Corpus and texts - Basic principles", en M. Wynne (ed.) Developing linguistic corpora: a guide to good practice. Oxford & Oakville: Oxbow Books, 116.

Vincze, O. (2015). "Learning multiword expressions from corpora and dictionaries" (tesis de doctorado), Universidade Da Coruña.

Zijlstra, W.P., van der Ark, A. y Sijtsma, K. (2007). "Outlier Detection in Test and Questionnaire Data". Multivariate Behavioral Research, 42/3, 531-555.