TY  - JOUR
T1  - Feature Engineering for Arabic Text Classification
AU - Khazal, Ghassan AU - Zamyatin, Alexander 
JO  - Journal of Engineering and Applied Sciences
VL  - 14
IS  - 7
SP  - 2292
EP  - 2301
PY  - 2019
DA  - 2001/08/19
SN  - 1816-949x
DO  - jeasci.2019.2292.2301
UR  - https://makhillpublications.co/view-article.php?doi=jeasci.2019.2292.2301
KW  - text preprocessing
KW  -Arabic text classification
KW  -Feature engineering
KW  -feature selection
KW  -stemming
techniques
KW  -classification techniques
AB  - Arabic is one of the most complex languages and it has a rich vocabulary also it has difficult and
different structure when compared with the others languages. Arabic language has many challenges in text
mining one these challenges are how to achieve highest classification accuracy. We proposed in this research
a feature engineering of the best combination of preprocessing procedures with appropriate feature
representation that has direct affected the classification accuracy of the Arabic text. Preprocessing and feature
representation represent the main steps in any text classification framework. This phase is very important to
design any text classifier that deals with this sophisticated language. In this study, we used four classification
classifiers Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (NB) and K-Nearest Neighbor
KNN. From analysis and experimental results on Arabic text data we reveal that preprocessing techniques and
feature representation and weighting have an important influence on the classification accuracy. Also, its
depend on choosing the suitable combinations of preprocessing tasks with the appropriate feature
representation and classification techniques provides a good improvement in the accuracy of classification.
This study shows that the SVM (82.6%) and KNN (78.33%) have better performance on average over the DT
(57.49%) and NB (76.21%). The SVM achieved accuracy (88.67%) with the combination of tokenization, filtering,
normalization and light stemming with TFIDF as feature representation and KNN classifier gives 88.00% using
the combination of tokenization, filtering as preprocessing and TFIDF as feature representation with information
gain as feature selection.
ER  -