Automatic Extraction of Structured Web Data with Domain Knowledge

We present in this paper a novel approach for extracting structured data from the Web, whose goal is to harvest real-world items from template-based HTML pages (the structured Web). It illustrates a two-phase querying of the Web, in which an intentional description of the data that is targeted is fi...

Full description

Saved in:

Bibliographic Details
Published in:	2012 IEEE 28th International Conference on Data Engineering pp. 726 - 737
Main Authors:	Derouiche, N., Cautis, B., Abdessalem, T.
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-04-2012
Subjects:	Data mining Feature extraction HTML Semantics Silicon Web pages Wrapping
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We present in this paper a novel approach for extracting structured data from the Web, whose goal is to harvest real-world items from template-based HTML pages (the structured Web). It illustrates a two-phase querying of the Web, in which an intentional description of the data that is targeted is first provided, in a flexible and widely applicable manner. The extraction process leverages then both the input description and the source structure. Our approach is domain-independent, in the sense that it applies to any relation, either flat or nested, describing real-world items. Extensive experiments on five different domains and comparison with the main state of the art extraction systems from literature illustrate its flexibility and precision. We advocate via our technique that automatic extraction and integration of complex structured data can be done fast and effectively, when the redundancy of the Web meets knowledge over the to-be-extracted data.
ISBN:	9781467300421 146730042X
ISSN:	1063-6382 2375-026X
DOI:	10.1109/ICDE.2012.90