Leveraging spatial join for robust tuple extraction from web pages

Authors:
Wook-Shin Han;Wooseong Kwak;Hwanjo Yu;Jeong-Hoon Lee;Min-Soo Kim
Affiliations:
-;-;-;-;-
Venue:
Information Sciences: an International Journal
Year:
2014

Citing 31
Cited 0

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Efficient processing of spatial joins using R-trees

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Maintaining knowledge about temporal intervals

Communications of the ACM
Clio: a semi-automatic tool for schema mapping

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A Formal Definition of Binary Topological Relationships

FOFO '89 Proceedings of the 3rd International Conference on Foundations of Data Organization and Algorithms
WebOQL: Restructuring Documents, Databases, and Webs

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Jedi: Extracting and Synthesizing Information from the Web

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Efficient keyword search for smallest LCAs in XML databases

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
MyPortal: robust extraction and aggregation of web content

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Making mashups with marmite: towards end-user programming for the web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Large-scale collaborative analysis and extraction of web data

Proceedings of the VLDB Endowment
Robust web extraction: an approach based on a probabilistic tree-edit model

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Combining topological and directional information for spatial reasoning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Efficient processing of spatial joins with DOT-based indexing

Information Sciences: an International Journal
XML materialized views and schema evolution in VIREX

Information Sciences: an International Journal
XML filtering with XPath expressions containing parent and ancestor axes

Information Sciences: an International Journal
Web-based closed-domain data extraction on online advertisements

Information Systems
Web mining based extraction of problem solution ideas

Expert Systems with Applications: An International Journal
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

Extracting tuples from HTML pages has been an important issue in various web applications. Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries to find attributes of tuples in the HTML pages. However, such systems would be vulnerable to small changes on the web pages. In this paper, we propose a robust tuple extraction system which utilizes spatial relationships among elements rather than the XPath queries. Spatial information (e.g., 2-D coordinates) of elements are maintained in the DOM tree when a web page is rendered in a browser. Our system regards elements in the rendered page as spatial objects in the 2-D space and executes spatial joins to extract target elements. Since humans also identify an element in a web page by its relative spatial location, our system extracting elements by their spatial relationships could possibly be as robust as manual extraction. To specify and execute spatial joins, we propose a new query language, RAQuery, based on topological relationships between any spatial objects in the 2-D space. We then propose spatial join algorithms that efficiently process the RAQuery using novel notions of group match and prunable relation group. We next propose a tuple construction algorithm to build tuples from the extracted elements obtained by the spatial joins, which can construct tuples even when there are no boundary HTML elements specified for the tuples in the web page. Extensive experimental results using real HTML pages confirm that our solutions are far more robust than existing tuple extraction systems without sacrificing performance.