Mining templates from search result records of search engines

Authors:
Hongkun Zhao;Weiyi Meng;Clement Yu
Affiliations:
State University of New York at Binghamton;State University of New York at Binghamton;University of Illinois at Chicago
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 25
Cited 16

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
A brief survey of web data extraction tools

ACM SIGMOD Record
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of schema inference from web pages in the presence of nullable data attributes

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Grubber: Allowing End-Users to Develop XML-Based Wrappers for Web Data Sources

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
SEEDEEP: A System for Exploring and Querying Scientific Deep Web Data Sources

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
An unsupervised approach for product record normalization across different web sites

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Federated Search

Foundations and Trends in Information Retrieval
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Automatic information extraction from the web: case study with recipes

Proceedings of the 50th Annual Southeast Regional Conference
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Metasearch engine, Comparison-shopping and Deep Web crawling applications need to extract search result records enwrapped in result pages returned from search engines in response to user queries. The search result records from a given search engine are usually formatted based on a template. Precisely identifying this template can greatly help extract and annotate the data units within each record correctly. In this paper, we propose a graph model to represent record template and develop a domain independent statistical method to automatically mine the record template for any search engine using sample search result records. Our approach can identify both template tags (HTML tags) and template texts (non-tag texts), and it also explicitly addresses the mismatches between the tag structures and the data structures of search result records. Our experimental results indicate that this approach is very effective.