Block edit models for approximate string matching
Theoretical Computer Science - Special issue: Latin American theoretical informatics
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Snowball: extracting relations from large plain-text collections
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Machine Learning
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Introduction to Information Retrieval
Introduction to Information Retrieval
Incorporating site-level knowledge to extract structured data from web forums
Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Collective extraction from heterogeneous web lists
Proceedings of the fourth ACM international conference on Web search and data mining
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
Wrapper Generation for Overlapping Web Sources
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Exploiting attribute redundancy for web entity data extraction
ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
An analysis of structured data on the web
Proceedings of the VLDB Endowment
Collective information extraction with context-specific consistencies
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Extraction and integration of partially overlapping web sources
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To match attribute values with diverse representations across sites, we define a new similarity metric that leverages the templatized structure of attribute content. Specifically, our metric discovers the matching pattern between attribute values from two sites, and uses this to ignore extraneous portions of attribute values when computing similarity scores. Further, to filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.