TY - JOUR T1 - Feature Engineering for Arabic Text Classification AU - Khazal, Ghassan AU - Zamyatin, Alexander JO - Journal of Engineering and Applied Sciences VL - 14 IS - 7 SP - 2292 EP - 2301 PY - 2019 DA - 2001/08/19 SN - 1816-949x DO - jeasci.2019.2292.2301 UR - https://makhillpublications.co/view-article.php?doi=jeasci.2019.2292.2301 KW - text preprocessing KW -Arabic text classification KW -Feature engineering KW -feature selection KW -stemming techniques KW -classification techniques AB - Arabic is one of the most complex languages and it has a rich vocabulary also it has difficult and different structure when compared with the others languages. Arabic language has many challenges in text mining one these challenges are how to achieve highest classification accuracy. We proposed in this research a feature engineering of the best combination of preprocessing procedures with appropriate feature representation that has direct affected the classification accuracy of the Arabic text. Preprocessing and feature representation represent the main steps in any text classification framework. This phase is very important to design any text classifier that deals with this sophisticated language. In this study, we used four classification classifiers Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (NB) and K-Nearest Neighbor KNN. From analysis and experimental results on Arabic text data we reveal that preprocessing techniques and feature representation and weighting have an important influence on the classification accuracy. Also, its depend on choosing the suitable combinations of preprocessing tasks with the appropriate feature representation and classification techniques provides a good improvement in the accuracy of classification. This study shows that the SVM (82.6%) and KNN (78.33%) have better performance on average over the DT (57.49%) and NB (76.21%). The SVM achieved accuracy (88.67%) with the combination of tokenization, filtering, normalization and light stemming with TFIDF as feature representation and KNN classifier gives 88.00% using the combination of tokenization, filtering as preprocessing and TFIDF as feature representation with information gain as feature selection. ER -