A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Alignment of Trees - An Alternative to Tree Edit
CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
Information Extraction - Tree Alignment Approach to Pattern Discovery in Web Documents
DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Template-based information mining from HTML documents
AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Rule identification using ontology while acquiring rules from Web pages
International Journal of Human-Computer Studies
Hi-index | 0.00 |
The World Wide Web has nowen tered its mature age. It not only hosts and serves large amounts of pages but also offers large amounts of information potentially useful for individuals and businesses. Modern decision support can no more be effective without timely and accurate access to this unprecedented source of data. However, unlike in a database, the structure of data available on the Web is not known a priori and its understanding seems to require human intervention. Yet the conjunction of rules for interpreting layout and simple domain knowledge enables in many cases the automatic extraction of such data. In such cases we say that data is semi-structured. In this paper, we present a framework in which we try to address the problem of extracting semi-structured data. This framework combines a syntactical extraction strategy with a set of mapping rules, heuristics and simple domain knowledge, which maps a syntactical structure identified in Web documents to a conceptual/ semantic structure. We present and analyse one instance of this framework in which a syntactical extraction strategy exploits the HTML structure of Web documents using a Tree Alignment algorithm with a novel combination of heuristics to detect repeated patterns and infer rules to extract relevant records. Then, by the use of domain knowledge, we refine the extraction rules such that not only are they able to extract data, but they also construe meaning to the extracted results.