A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Alignment of Trees - An Alternative to Tree Edit
CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
Extracting Information from Semi-structured Web Documents
OOIS '02 Proceedings of the Workshops on Advances in Object-Oriented Information Systems
Information extraction from structured documents using k-testable tree automaton inference
Data & Knowledge Engineering
Hi-index | 0.00 |
The World Wide Web has now entered its mature age. It not only hosts and serves large amounts of pages but also offers large amounts of information potentially useful for individuals and businesses. Modern decision support can no more be effective without timely and accurate access to this unprecedented source of data. However, unlike in a database, the structure of data available on the Web is not known apriori and its understanding seems to require human intervention. Yet the conjunction of layout rules and simple domain knowledge enables in many cases the automatic understanding of such unstructured data. In such cases we say that data is semi-structured. Wrapper generation for automatic extraction of information from the Web has therefore been a crucial challenge in the recent years. Various authors have suggested different approaches for extracting semi-structured data from the Web, ranging from analyzing the layout and syntax of Web documents to learning extraction rules from user's training examples. In this paper, we propose to exploit the HTML structure of Web documents that contain information in the form of multiple homogeneous records. We use a Tree Alignment algorithm with a novel combination of heuristics to detect repeated patterns and infer extraction rules. The performance study shows that our approach is effective in practice, yielding practical performance and accurate results.