Extracting data records from the web using tag path clustering

Authors:
Gengxin Miao;Junichi Tatemura;Wang-Pin Hsiung;Arsany Sawires;Louise E. Moser
Affiliations:
University of California, Santa Barbara, Santa Barbara, CA, USA;NEC Laboratories America, Cupertino, CA, USA;NEC Laboratories America, Cupertino, CA, USA;NEC Laboratories America, Cupertino, CA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 18
Cited 35

Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Testbed for information extraction from deep web

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Structured databases on the web: observations and implications

ACM SIGMOD Record
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
UQBE: uncertain query by example for web service mashup

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Automatic extraction of clickable structured web contents for name entity queries

Proceedings of the 19th international conference on World wide web
Entity relation discovery from web tables and links

Proceedings of the 19th international conference on World wide web
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
A novel method for bilingual web page acquisition from search engine web records

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Lightweight collaboration management

Proceedings of the 3rd and 4th International Workshop on Web APIs and Services Mashups
Semi-supervised truth discovery

Proceedings of the 20th international conference on World wide web
FACTO: a fact lookup engine based on web tables

Proceedings of the 20th international conference on World wide web
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
A local information passing clustering algorithm for tagging systems

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Web information extraction using markov logic networks

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Concluding pattern of web page based on string pattern matching

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Semi-supervised multi-task learning of structured prediction models for web information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Exploiting attribute redundancy for web entity data extraction

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Data extraction for search engine using safe matching

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
APPECT: an approximate backbone-based clustering algorithm for tags

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Building enriched web page representations using link paths

Proceedings of the 23rd ACM conference on Hypertext and social media
A system for extracting top-K lists from the web

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting data records from web using suffix tree

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Clustering visually similar web page elements for structured web data extraction

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Web-based closed-domain data extraction on online advertisements

Information Systems
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Structured positional entity language model for enterprise entity retrieval

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
Linkage of compound objects for supporting maintenance of large-scale web sites

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitation -- their greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous methods.