Automatic extraction of dynamic record sections from search engine result pages

Authors:
Hongkun Zhao;Weiyi Meng;Clement Yu
Affiliations:
SUNY at Binghamton, Binghamton, NY;SUNY at Binghamton, Binghamton, NY;University of Illinois at Chicago, Chicago, IL
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 28
Cited 16

Fast text searching: allowing errors

Communications of the ACM
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Algorithm 457: finding all cliques of an undirected graph

Communications of the ACM
The stable marriage problem

Communications of the ACM
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
A brief survey of web data extraction tools

ACM SIGMOD Record
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Wrapper induction for information extraction

Wrapper induction for information extraction
Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Structured databases on the web: observations and implications

ACM SIGMOD Record
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

BIS'07 Proceedings of the 10th international conference on Business information systems
WMS-extracting multiple sections data records from search engine results pages

Proceedings of the 2010 ACM Symposium on Applied Computing
Blog post and comment extraction using information quantity of web format

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
A novel method for bilingual web page acquisition from search engine web records

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Federated Search

Foundations and Trends in Information Retrieval
Potential role based entity matching for dataspaces search

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
An automatic web news article contents extraction system based on RSS feeds

Journal of Web Engineering
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Automated functional testing of online search services

Software Testing, Verification & Reliability
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A search engine returned result page may contain search results that are organized into multiple dynamically generated sections in response to a user query. Furthermore, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine. In this paper, we present a method to automatically generate wrappers for extracting search result records from all dynamic sections on result pages returned by search engines. This method has the following novel features: (1) it aims to explicitly identify all dynamic sections, including those that are not seen on sample result pages used to generate the wrapper, and (2) it addresses the issue of correctly differentiating sections and records. Experimental results indicate that this method is very promising. Automatic search result record extraction is critical for applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling.