Tuning Language Processing Approaches for Pashto Texts Classification


Jawid Ahmad Baktash, Mursal Dawodi, University Avignon, France


Nowadays, text classification for different purposes becomes a basic task for concerned people. Hence, much research has been done to develop automatic text classification for the majority of national and international languages. However, the need for an automated text classification system for local languages is felt. The main purpose of this study is to establish a novel automatic classification system of Pashto text. In order to follow this up, we established a collection of Pashto documents and constructed the dataset. In addition, this study includes several models containing statistical techniques and neural network neural machine learning including DistilBERT-base-multilingual-cased, Multilayer Perceptron, Support Vector Machine, K Nearest Neighbor, decision tree, Gaussian naïve Bayes, multinomial naïve Bayes, random forest, and logistic regression to discover the most effective approach. Moreover, this investigation evaluates two different feature extraction methods including bag of words, and Term Frequency Inverse Document Frequency. Subsequently, this research obtained an average testing accuracy rate of 94% using the MLP classification algorithm and TFIDF feature extraction method in single label multi-class classification. Similarly, MLP+TFIDF with F1-measure of 0.81 showed the best result. Experiments on the use of pre-trained language representation models (such as DistilBERT) for classifying Pashto texts show that we need a specific tokenizer for a particular language to obtain reasonable results.


Pashto, DistilBERT, BERT, Multi-lingual BERT, Multi-layer Perceptron, Support Vector Machine, K Nearest Neighbor, Decision Tree, Random Forest, Logistic Regression, Gaussian Naïve Bayes, Multinomial Naïve Bayes, TFIDF, Unigram, Deep Neural Network, Classification