Recognizing structure in Web pages using similarity queries

Authors:
William W. Cohen
Affiliations:
-
Venue:
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Year:
1999

Citing 9
Cited 21

Automatic text processing

Automatic text processing
Infomaster: an information integration system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The distributed information search component (Disco) and the World Wide Web

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Araneus Web-based management system

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
User-oriented smart-cache for the Web: what you seek is what you get!

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A Web-based information system that reasons with structured collections of text

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Modeling Web sources for information integration

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Querying Heterogeneous Information Sources Using Source Descriptions

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Learning to extract hierarchical information from semi-structured documents

Proceedings of the ninth international conference on Information and knowledge management
Bootstrapping for example-based data extraction

Proceedings of the tenth international conference on Information and knowledge management
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Wrapper verification

World Wide Web
The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Using Grammatical Inference to Automate Information Extraction from the Web

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
Information Extraction in Structured Documents Using Tree Automata Induction

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
A Case-Based Recognition of Semantic Structures in HTML Documents

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Natural Language Guided Dialogues for Accessing the Web

TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
Extracting Information from the Web for Concept Learning and Collaborative Filtering

ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory
Accurately and reliably extracting data from the Web: a machine learning approach

Intelligent exploration of the web
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
Semantic Similarity Search on Semistructured Data with the XXL Search Engine

Information Retrieval
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web

IEEE Transactions on Knowledge and Data Engineering
Information extraction from structured documents using k-testable tree automaton inference

Data & Knowledge Engineering
Open information pools

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
A methodical approach to extracting interesting objects from dynamic web pages

International Journal of Web and Grid Services
A Visual Technique for Web Pages Comparison

Electronic Notes in Theoretical Computer Science (ENTCS)
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
WebSelF: a web scraping framework

ICWE'12 Proceedings of the 12th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked first by our method is "meaningful"--i.e., a structure that was used in a hand-coded "wrapper", or extraction program, for the page-nearly 70% of the time. This improves on a value of 50% obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, the top-ranked structure is meaningful nearly 85% of the time.