Fully automatic wrapper generation for search engines

Authors:
Hongkun Zhao;Weiyi Meng;Zonghuan Wu;Vijay Raghavan;Clement Yu
Affiliations:
SUNY at Binghamton, Binghamton, NY;SUNY at Binghamton, Binghamton, NY;Univ. of Louisiana at Lafayette, Lafayette, LA;Univ. of Louisiana at Lafayette, Lafayette, LA;University of Illinois at Chicago, Chicago, IL
Venue:
WWW '05 Proceedings of the 14th international conference on World Wide Web
Year:
2005

Citing 21
Cited 73

Fast text searching: allowing errors

Communications of the ACM
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building efficient and effective metasearch engines

ACM Computing Surveys (CSUR)
A brief survey of web data extraction tools

ACM SIGMOD Record
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic composite wrapper generation for semi-structured biological data based on table structure identification

ACM SIGMOD Record

ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
The portrait of a common HTML web page

Proceedings of the 2006 ACM symposium on Document engineering
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Adaptive record extraction from web pages

Proceedings of the 16th international conference on World Wide Web
AllInOneNews: development and evaluation of a large-scale news metasearch engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MySearchView: a customized metasearch engine generator

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pictor: an interactive system for importing data from a website

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Experiences in crawling deep web in the context of local search

Proceedings of the 2nd international workshop on Geographic information retrieval
SESQ: A Model-Driven Method for Building Object Level Vertical Search Engines

ER '08 Proceedings of the 27th International Conference on Conceptual Modeling
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
A survey of Web clustering engines

ACM Computing Surveys (CSUR)
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Crawling and Extracting Process Data from the Web

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Entropy-Based Visual Tree Evaluation on Block Extraction

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Visual extraction of information from web pages

Journal of Visual Languages and Computing
Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

BIS'07 Proceedings of the 10th international conference on Business information systems
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
WMS-extracting multiple sections data records from search engine results pages

Proceedings of the 2010 ACM Symposium on Applied Computing
Mining subtrees with frequent occurrence of similar subtrees

DS'07 Proceedings of the 10th international conference on Discovery science
Configurable meta-search in the job domain

International Journal of Web Engineering and Technology
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A unified approach for extracting multiple news attributes from news pages

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Automatically extracting web data records

AMT'10 Proceedings of the 6th international conference on Active media technology
Federated Search

Foundations and Trends in Information Retrieval
Adaptable wrapper generation for web page format change

ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
Incremental structured web database crawling via history versions

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Extract knowledge from semi-structured websites for search task simplification

Proceedings of the 20th ACM international conference on Information and knowledge management
SILA: a spatial instance learning approach for deep webpages

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Automated extraction of hit numbers from search result pages

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
RecipeCrawler: collecting recipe data from WWW incrementally

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
CCWrapper: adaptive predefined schema guided web extraction

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Configurable meta-search for integrating web public access catalogs

ICADL'05 Proceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences
AMBER: turning annotations into knowledge

Proceedings of the 21st international conference companion on World Wide Web
Data extraction for search engine using safe matching

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
Clustering visually similar web page elements for structured web data extraction

ICWE'12 Proceedings of the 12th international conference on Web Engineering
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

International Journal of Data Warehousing and Mining
A general theory of spatial relations to support a graphical tool for visual information extraction

Journal of Visual Languages and Computing
SearchResultFinder: federated search made easy

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
A learning classifier-based approach to aligning data items and labels

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

When a query is submitted to a search engine, the search engine returns a dynamically generated result page containing the result records, each of which usually consists of a link to and/or snippet of a retrieved Web page. In addition, such a result page often also contains information irrelevant to the query, such as information related to the hosting site of the search engine and advertisements. In this paper, we present a technique for automatically producing wrappers that can be used to extract search result records from dynamically generated result pages returned by search engines. Automatic search result record extraction is very important for many applications that need to interact with search engines such as automatic construction and maintenance of metasearch engines and deep Web crawling. The novel aspect of the proposed technique is that it utilizes both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Experimental results indicate that this technique can achieve very high extraction accuracy.