Extracting Features from Textual Data in Class Imbalance Problems


  • Sarang Aravamuthan Ericsson
  • Prasad Jogalekar Ericsson
  • Jonghae Lee Ericsson




class imbalance, feature selection, n-gram frequency, NLP techniques, random forest classifier


We address class imbalance problems. These are classification problems where the target variable is binary, and one class dominates over the other. A central objective in these problems is to identify features that yield models with high precision/recall values, the standard yardsticks for assessing such models. Our features are extracted from the textual data inherent in such problems. We use n-gram frequencies as features and introduce a discrepancy score that measures the efficacy of an n-gram in highlighting the minority class. The frequency counts of n-grams with the highest discrepancy scores are used as features to construct models with the desired metrics. According to the best practices followed by the services industry, many customer support tickets will get audited and tagged as “contract-compliant” whereas some will be tagged as “over-delivered”. Based on in-field data, we use a random forest classifier and perform a randomized grid search over the model hyperparameters. The model scoring is performed using an scoring function. Our objective is to minimize the follow-up costs by optimizing the recall score while maintaining a base-level precision score. The final optimized model achieves an acceptable recall score while staying above the target precision. We validate our feature selection method by comparing our model with one constructed using frequency counts of n-grams chosen randomly. We propose extensions of our feature extraction method to general classification (binary and multi-class) and regression problems. The discrepancy score is one measure of dissimilarity of distributions and other (more general) measures that we formulate could potentially yield more effective models.


Download data is not yet available.


Batuwita, Rukshan, and Vasile Palade. 2010. "FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning." IEEE Transactions on Fuzzy Systems 18: 558-571. https://doi.org/10.1109/TFUZZ.2010.2042721

Bi, Jingjun, and Chongsheng Zhang. 2018. "An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme." Knowledge-Based Systems 158: 81-93. https://doi.org/10.1016/j.knosys.2018.05.037

Brownlee, Jason. 2020. "Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning." Machine Learning Mastery. https://books.google.com/books/about/Imbalanced_Classification_with_Python.html?id=jaXJDwAAQBAJ

Chawla, Nitesh V. 2009. "Data Mining for Imbalanced Datasets: An Overview." In Data Mining and Knowledge Discovery Handbook, edited by O. Maimon and L. Rokach, Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_45

He, Haibo, and Edwardo A. Garcia. 2009. "Learning from Imbalanced Data." IEEE Transactions on Knowledge and Data Engineering 21: 1263-1284. https://doi.org/10.1109/TKDE.2008.239

Ho, Tin K., and M. Basu. 2002. "Complexity measures of supervised classification problems." IEEE Transactions on Pattern Analysis and Machine Intelligence 24: 289-300. https://doi.org/10.1109/34.990132

Liu, Xu-Ling, Jianxin Wu, and Zhi-Hua Zhou. 2009. "Exploratory Undersampling for Class-Imbalance Learning." IEEE Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 39: 539-550. https://doi.org/10.1109/TSMCB.2008.2007853

Prati, Ronaldo C., Gustavo E.A.P.A. Batista and Maria C. Monard. 2004. "Class imbalances versus class overlapping: an analysis of a learning system behavior." 4th Mexican International Conference on Artificial Intelligence. LNCS, Mexico City, 2972: 312-321. https://doi.org/10.1007/978-3-540-24694-7_32

Rivera, Gilberto, Rogelio Florencia, Vicente García, Alejandro Ruiz, and J. Patricia Sánchez-Solís. 2020. "News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning." Applied Sciences 10, 6253. https://doi.org/10.3390/app10186253

Santos, Miriam S, Jastin Pompeu Soares, Pedro Henriques Abreu, Hélder Araújo and João Santos. 2018. "Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]." IEEE Computational Intelligence Magazine, 13: 59-76. https://doi.org/10.1109/MCI.2018.2866730

Santos, Miriam S, Pedro Henriques Abreu, Nathalie Japkowicz, Alberto Fernández, and João Santos. 2023. "A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research." Information Fusion 89: 228-253. https://doi.org/10.1016/j.inffus.2022.08.017

Sarmanova, Akkenzhe, and Songül Albayrak. 2013. "Alleviating Class Imbalance Problem In Data Mining." 21st Signal Processing and Communications Applications Conference (SIU) 1-4. https://doi.org/10.1109/SIU.2013.6531574

Soda, Paolo. 2011. "A multi-objective optimisation approach for class imbalance learning." Pattern Recognition 44: 1801-1810. https://doi.org/10.1016/j.patcog.2011.01.015

Sotiropoulos, Dionysios, Christos Giannoulis, and George A. Tsihrintzis. 2014 "A comparative study of one-class classifiers in machine learning problems with extreme class imbalance." The 5th International Conference on Information, Intelligence, Systems and Applications 362-364. https://doi.org/10.1109/IISA.2014.6878723

Tahvili, Sahar, Leo Hatvani, Enislay Ramentol, Rita Pimentel, Wasif Afzal, and Francisco Herrera. 2020. "A novel methodology to classify test cases using natural language processing and imbalanced learning." Engineering Applications of Artificial Intelligence, 95, 103878. https://doi.org/10.1016/j.engappai.2020.103878

Wang, Shuo, Leandro L. Minku, and Xin Yao. 2015. "Resampling-Based Ensemble Methods for Online Class Imbalance Learning." IEEE Transactions on Knowledge and Data Engineering 27: 1356-1368. https://doi.org/10.1109/TKDE.2014.2345380

Wang, Shuo, Leandro L. Minku, and Xin Yao. 2018. "A Systematic Study of Online Class Imbalance Learning With Concept Drift." IEEE Transactions on Neural Networks and Learning Systems 29: 4802-4821. https://doi.org/10.1109/TNNLS.2017.2771290

Wang, Shuo, and Xin Yao. 2013. "Using Class Imbalance Learning for Software Defect Prediction." IEEE Transactions on Reliability 62: 434-443. https://doi.org/10.1109/TR.2013.2259203

Zhang, Chongsheng, Jingjun Bi, Shixin Xu, Enislay Ramentol, Gaojuan Fan, Baojun Qiao, and Hamido Fujita. 2019. "Multi-Imbalance: An open-source software for multi-class imbalance learning." Knowledge-Based Systems 174: 137-143. https://doi.org/10.1016/j.knosys.2019.03.001