TY - JOUR T1 - Machine Learning-Based Topical Web Crawler: An Ensemble Approach Incorporating Meta-Features AU - Kim, Tae Jun AU - Joon Kim, Han- JO - Journal of Engineering and Applied Sciences VL - 12 IS - 18 SP - 4651 EP - 4656 PY - 2017 DA - 2001/08/19 SN - 1816-949x DO - jeasci.2017.4651.4656 UR - https://makhillpublications.co/view-article.php?doi=jeasci.2017.4651.4656 KW - meta-features KW -web crawler KW -ensemble KW -Machine learning KW -filtering KW -extensive AB - A topical web crawler is to collect web pages that describe some pre-specified topics. The web pages collected by the topical crawler share the same or similar words and however among them not a few pages can be irrelevant to the given topics. In particular, the performance of topical crawler degrades for a more specific topic. To achieve successful topical crawling, an additional job is required to actively filter out the pages irrelevant to the given topics. For this we propose an ensemble-style machine learning architecture that can effectively handle not only literal term features but also numeric meta-features to improve topical web crawler; in our work we intend to more precisely crawl the web pages about ‘fire accidents’ as a specific topic. In case of the fire we have found that significant meta-features for topical crawling include the information of tags, the number of words in the title, the number of person names, the number of location names of web pages and so forth. For the numeric meta-features we use the logistic regression and random forest learning algorithms and for the literal word features, Naive Bayes and support vector learning algorithms. Through extensive experiments using the fire accident-related news articles we prove that the proposed method outperforms the conventional ones. ER -