TY  - JOUR
T1  - Machine Learning-Based Topical Web Crawler: An Ensemble Approach Incorporating Meta-Features
AU - Kim, Tae Jun AU - Joon Kim, Han- 
JO  - Journal of Engineering and Applied Sciences
VL  - 12
IS  - 18
SP  - 4651
EP  - 4656
PY  - 2017
DA  - 2001/08/19
SN  - 1816-949x
DO  - jeasci.2017.4651.4656
UR  - https://makhillpublications.co/view-article.php?doi=jeasci.2017.4651.4656
KW  - meta-features
KW  -web crawler
KW  -ensemble
KW  -Machine learning
KW  -filtering
KW  -extensive
AB  - A topical web crawler is to collect web pages that describe some pre-specified topics. The web pages collected by the topical crawler share the same or similar words and however among them not a few pages can be irrelevant to the given topics. In particular, the performance of topical crawler degrades for a more specific topic. To achieve successful topical crawling, an additional job is required to actively filter out the pages irrelevant to the given topics. For this we propose an ensemble-style machine learning architecture that can effectively handle not only literal term features but also numeric meta-features to improve topical web crawler; in our work we intend to more precisely crawl the web pages about &#145;fire accidents&#146; as a specific topic. In case of the fire we have found that significant meta-features for topical crawling include the information of tags, the number of words in the title, the number of person names, the number of location names of web pages and so forth. For the numeric meta-features we use the logistic regression and random forest learning algorithms and for the literal word features, Naive Bayes and support vector learning algorithms. Through extensive experiments using the fire accident-related news articles we prove that the proposed method outperforms the conventional ones.
ER  -