Thresher: automating the unwrapping of semantic content from the World Wide Web

Authors:
Andrew Hogue;David Karger
Affiliations:
Google Inc., New York, NY and MIT CSAIL, Cambridge, MA;MIT CSAIL, Cambridge, MA
Venue:
WWW '05 Proceedings of the 14th international conference on World Wide Web
Year:
2005

Citing 10
Cited 36

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Annotea: an open RDF infrastructure for shared Web annotations

Proceedings of the 10th international conference on World Wide Web
Wrapper verification

World Wide Web
Sticky notes for the semantic web

Proceedings of the 8th international conference on Intelligent user interfaces
New Tools for the Semantic Web

EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Using urls and table layout for web classification tasks

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Lightweight structured text processing

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference

Documentum ECI self-repairing wrappers: performance analysis

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Summarizing personal web browsing sessions

UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Enabling web browsers to augment web sites' filtering and sorting functionalities

UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

ACM Transactions on Internet Technology (TOIT)
Piggy Bank: Experience the Semantic Web inside your web browser

Web Semantics: Science, Services and Agents on the World Wide Web
U-REST: an unsupervised record extraction system

Proceedings of the 16th international conference on World Wide Web
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Magpie: Experiences in supporting Semantic Web browsing

Web Semantics: Science, Services and Agents on the World Wide Web
Relations, cards, and search templates: user-guided web data integration and layout

Proceedings of the 20th annual ACM symposium on User interface software and technology
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Grubber: Allowing End-Users to Develop XML-Based Wrappers for Web Data Sources

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Pattern-Based Annotation of HTML-Streams

ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
SemNews: a semantic news framework

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
SemNews: a semantic news framework

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Finding and Extracting Data Records from Web Pages

Journal of Signal Processing Systems
Semantic annotation for knowledge management: Requirements and a survey of the state of the art

Web Semantics: Science, Services and Agents on the World Wide Web
The paths more taken: matching DOM trees to search logs for accurate webpage clustering

Proceedings of the 19th international conference on World wide web
WMS-extracting multiple sections data records from search engine results pages

Proceedings of the 2010 ACM Symposium on Applied Computing
Digging the wild web: an interactive tool for web data consolidation

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Best of both: using semantic web technologies to enrich user interaction with the web and vice versa

SOFSEM'08 Proceedings of the 34th conference on Current trends in theory and practice of computer science
Web news extraction based on path pattern mining

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
ObjectRunner: lightweight, targeted extraction and querying of structured web data

Proceedings of the VLDB Endowment
Integrating keywords and semantics on document annotation and search

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Piggy bank: experience the semantic web inside your web browser

ISWC'05 Proceedings of the 4th international conference on The Semantic Web
A theoretical analysis of alignment and edit problems for trees

ICTCS'05 Proceedings of the 9th Italian conference on Theoretical Computer Science
Sift: an end-user tool for gathering web content on the go

Proceedings of the 2012 ACM symposium on Document engineering
Mix-n-Match: building personal libraries from web content

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe Thresher, a system that lets non-technical users teach their browsers how to extract semantic web content from HTML documents on the World Wide Web. Users specify examples of semantic content by highlighting them in a web browser and describing their meaning. We then use the tree edit distance between the DOM subtrees of these examples to create a general pattern, or wrapper, for the content, and allow the user to bind RDF classes and predicates to the nodes of these wrappers. By overlaying matches to these patterns on standard documents inside the Haystack semantic web browser, we enable a rich semantic interaction with existing web pages, "unwrapping" semantic data buried in the pages' HTML. By allowing end-users to create, modify, and utilize their own patterns, we hope to speed adoption and use of the Semantic Web and its applications.