Efficient record-level wrapper induction

Authors:
Shuyi Zheng;Ruihua Song;Ji-Rong Wen;C. Lee Giles
Affiliations:
Pennsylvania State University, University Park, PA, USA;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 18
Cited 9

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web wrapper induction: a brief survey

AI Communications
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Pictor: an interactive system for importing data from a website

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Unsupervised user-generated content extraction by dependency relationships

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from web using suffix tree

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Turn the page: automated traversal of paginated websites

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
SearchResultFinder: federated search made easy

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
Accelerating Structured Web Crawling without Losing Data

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web information is often presented in the form of record, e.g., a product record on a shopping website or a personal profile on a social utility website. Given a host webpage and related information needs, how to identify relevant records as well as their internal semantic structures is critical to many online information systems. Wrapper induction is one of the most effective methods for such tasks. However, most traditional wrapper techniques have issues dealing with web records since they are designed to extract information from a page, not a record. We propose a record-level wrapper system. In our system, we use a novel ``broom'' structure to represent both records and generated wrappers. With such representation, our system is able to effectively extract records and identify their internal semantics at the same time. We test our system on 16 real-life websites from four different domains. Experimental results demonstrate 99\% extraction accuracy in terms of F1-Value.