An Approach of Web Scraping on News Website based on Regular Expression

The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik,...

Full description

Saved in:
Bibliographic Details
Published in:2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT) pp. 203 - 207
Main Authors: Maududie, Achmad, Retnani, Windi Eka Yulia, Rohim, Muhamat Abdul
Format: Conference Proceeding
Language:English
Published: IEEE 01-11-2018
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik, Tribunnews, and Liputan6 without manually copy and paste process. This approach consists of three steps, i.e.: analyzing news website structure, constructing pattern of Regex and implementing the patterns as a set of rule in web scraping. Based on the experiment, each news web site used their own pattern for article link, article title, article author, and publication date of article. Special for extracting a clean text of news article phase, there were two kinds of pattern i.e.: content pattern (for extracting original text article of news) and filter pattern (for eliminating non-news elements). In these three-news website, the non-news elements consist of text advertisement, video advertisement, link, image, and script with different pattern for every website. After generated all necessary patterns and implemented these patterns as a set of rules, the web scraping module produced very good results of news article extraction on Detik and Tribunnews that was presented by recall = 1, precision = 1 and F-Measure =100% while Liputan6 had a little bit lower i.e., recall =0.95, precision =0.95, and F-Measure =95%. It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.
DOI:10.1109/EIConCIT.2018.8878550