ViPER: augmenting automatic information extraction with visual perceptions

Authors:
Kai Simon;Georg Lausen
Affiliations:
Universität Freiburg, Freiburg, Germany;Universität Freiburg, Freiburg, Germany
Venue:
Proceedings of the 14th ACM international conference on Information and knowledge management
Year:
2005

Citing 14
Cited 40

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
The complexity of multiple sequence alignment with SP-score that is a metric

Theoretical Computer Science
A brief survey of web data extraction tools

ACM SIGMOD Record
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
EMAGEN: an efficient approach to multiple whole genome alignment

APBC '04 Proceedings of the second conference on Asia-Pacific bioinformatics - Volume 29
Testbed for information extraction from deep web

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Linear time algorithms for finding and representing all the tandem repeats in a string

Journal of Computer and System Sciences
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

From HTML documents to web tables and rules

ICEC '06 Proceedings of the 8th international conference on Electronic commerce: The new e-commerce: innovations for conquering current barriers, obstacles and limitations to conducting successful business on the internet
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Pattern-Based Annotation of HTML-Streams

ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Distilling Informative Content from HTML News Pages

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Post processing wrapper generated tables for labeling anonymous datasets

Proceedings of the eleventh international workshop on Web information and data management
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

BIS'07 Proceedings of the 10th international conference on Business information systems
WMS-extracting multiple sections data records from search engine results pages

Proceedings of the 2010 ACM Symposium on Applied Computing
Creation, population and preprocessing of experimental data sets for evaluation of applications for the semantic web

SOFSEM'08 Proceedings of the 34th conference on Current trends in theory and practice of computer science
Finding and using the content texts of HTML pages

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
No Code Required: Giving Users Tools to Transform the Web

No Code Required: Giving Users Tools to Transform the Web
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
HyLiEn: a hybrid approach to general list extraction on the web

Proceedings of the 20th international conference companion on World wide web
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
Little knowledge rules the web: domain-centric result page extraction

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Concluding pattern of web page based on string pattern matching

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Information gathering in a dynamic world

PPSWR'06 Proceedings of the 4th international conference on Principles and Practice of Semantic Web Reasoning
Automatically learning gazetteers from the deep web

Proceedings of the 21st international conference companion on World Wide Web
Data extraction for search engine using safe matching

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
Peer matrix alignment: a new algorithm

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
SearchResultFinder: federated search made easy

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
A learning classifier-based approach to aligning data items and labels

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised Web data extraction becomes feasible when supposing pages that are made up of repetitive patterns, as it is the case, e.g., for search engine result pages. Hereby the extraction rules are generated automatically without any training or human interaction, by means of operating on the DOM tree respectively the flat tag token sequence of a single page.Our contribution to automatic data extraction through this paper is twofold. First, we identify and rank potential repetitive patterns with respect to the user's visual perception of the Web page, well aware that location and size of matching elements within a Web page constitute important criteria for defining relevance. Second, matching sub-sequences of the pattern with the highest weightiness are aligned with global multiple sequence alignment techniques. Experimental results show that our system is able to achieve high accuracy in distilling and aligning regularly structured objects inside complex Web pages.