Automatic text processing
Infomaster: an information integration system
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The distributed information search component (Disco) and the World Wide Web
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Araneus Web-based management system
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
User-oriented smart-cache for the Web: what you seek is what you get!
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A Web-based information system that reasons with structured collections of text
AGENTS '98 Proceedings of the second international conference on Autonomous agents
Modeling Web sources for information integration
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Querying Heterogeneous Information Sources Using Source Descriptions
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Learning to extract hierarchical information from semi-structured documents
Proceedings of the ninth international conference on Information and knowledge management
Bootstrapping for example-based data extraction
Proceedings of the tenth international conference on Information and knowledge management
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
World Wide Web
The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking
EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Using Grammatical Inference to Automate Information Extraction from the Web
PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
Information Extraction in Structured Documents Using Tree Automata Induction
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
A Case-Based Recognition of Semantic Structures in HTML Documents
IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Natural Language Guided Dialogues for Accessing the Web
TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
Extracting Information from the Web for Concept Learning and Collaborative Filtering
ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory
Accurately and reliably extracting data from the Web: a machine learning approach
Intelligent exploration of the web
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
Semantic Similarity Search on Semistructured Data with the XXL Search Engine
Information Retrieval
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web
IEEE Transactions on Knowledge and Data Engineering
Information extraction from structured documents using k-testable tree automaton inference
Data & Knowledge Engineering
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
A methodical approach to extracting interesting objects from dynamic web pages
International Journal of Web and Grid Services
A Visual Technique for Web Pages Comparison
Electronic Notes in Theoretical Computer Science (ENTCS)
Wrapper maintenance: a machine learning approach
Journal of Artificial Intelligence Research
WebSelF: a web scraping framework
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Hi-index | 0.00 |
We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked first by our method is "meaningful"--i.e., a structure that was used in a hand-coded "wrapper", or extraction program, for the page-nearly 70% of the time. This improves on a value of 50% obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, the top-ranked structure is meaningful nearly 85% of the time.