Exploiting content redundancy for web information extraction

Authors:
Pankaj Gulhane;Rajeev Rastogi;Srinivasan H. Sengamedu;Ashwin Tengli
Affiliations:
Yahoo! Labs, Bangalore;Yahoo! Labs, Bangalore;Yahoo! Labs, Bangalore;Microsoft IDC, Bangalore
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 22
Cited 7

Block edit models for approximate string matching

Theoretical Computer Science - Special issue: Latin American theoretical informatics
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Markov logic networks

Machine Learning
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Introduction to Information Retrieval

Introduction to Information Retrieval
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web

Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
Wrapper Generation for Overlapping Web Sources

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Exploiting attribute redundancy for web entity data extraction

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
An analysis of structured data on the web

Proceedings of the VLDB Endowment
Collective information extraction with context-specific consistencies

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Extraction and integration of partially overlapping web sources

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To match attribute values with diverse representations across sites, we define a new similarity metric that leverages the templatized structure of attribute content. Specifically, our metric discovers the matching pattern between attribute values from two sites, and uses this to ignore extraneous portions of attribute values when computing similarity scores. Further, to filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.