Query evaluation techniques for large databases
ACM Computing Surveys (CSUR)
Efficient processing of spatial joins using R-trees
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Template-based wrappers in the TSIMMIS system
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Maintaining knowledge about temporal intervals
Communications of the ACM
Clio: a semi-automatic tool for schema mapping
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A Formal Definition of Binary Topological Relationships
FOFO '89 Proceedings of the 3rd International Conference on Foundations of Data Organization and Algorithms
WebOQL: Restructuring Documents, Databases, and Webs
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Jedi: Extracting and Synthesizing Information from the Web
COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Efficient keyword search for smallest LCAs in XML databases
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
ViPER: augmenting automatic information extraction with visual perceptions
Proceedings of the 14th ACM international conference on Information and knowledge management
MyPortal: robust extraction and aggregation of web content
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Making mashups with marmite: towards end-user programming for the web
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
Mining templates from search result records of search engines
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Large-scale collaborative analysis and extraction of web data
Proceedings of the VLDB Endowment
Robust web extraction: an approach based on a probabilistic tree-edit model
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Combining topological and directional information for spatial reasoning
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Efficient processing of spatial joins with DOT-based indexing
Information Sciences: an International Journal
XML materialized views and schema evolution in VIREX
Information Sciences: an International Journal
XML filtering with XPath expressions containing parent and ancestor axes
Information Sciences: an International Journal
Web-based closed-domain data extraction on online advertisements
Information Systems
Web mining based extraction of problem solution ideas
Expert Systems with Applications: An International Journal
A hybrid approach for extracting informative content from web pages
Information Processing and Management: an International Journal
Hi-index | 0.07 |
Extracting tuples from HTML pages has been an important issue in various web applications. Commercial tuple extraction systems have enjoyed some success to extract tuples by regarding HTML pages as tree structures and exploiting XPath queries to find attributes of tuples in the HTML pages. However, such systems would be vulnerable to small changes on the web pages. In this paper, we propose a robust tuple extraction system which utilizes spatial relationships among elements rather than the XPath queries. Spatial information (e.g., 2-D coordinates) of elements are maintained in the DOM tree when a web page is rendered in a browser. Our system regards elements in the rendered page as spatial objects in the 2-D space and executes spatial joins to extract target elements. Since humans also identify an element in a web page by its relative spatial location, our system extracting elements by their spatial relationships could possibly be as robust as manual extraction. To specify and execute spatial joins, we propose a new query language, RAQuery, based on topological relationships between any spatial objects in the 2-D space. We then propose spatial join algorithms that efficiently process the RAQuery using novel notions of group match and prunable relation group. We next propose a tuple construction algorithm to build tuples from the extracted elements obtained by the spatial joins, which can construct tuples even when there are no boundary HTML elements specified for the tuples in the web page. Extensive experimental results using real HTML pages confirm that our solutions are far more robust than existing tuple extraction systems without sacrificing performance.