The lixto project: exploring new frontiers of web data extraction

Authors:
Julien Carme;Michal Ceresna;Oliver Frölich;Georg Gottlob;Tamir Hassan;Marcus Herzog;Wolfgang Holzinger;Bernhard Krüpl
Affiliations:
Database and Artificial Intelligence Group, Vienna University of Technology, Wien, Austria;Database and Artificial Intelligence Group, Vienna University of Technology, Wien, Austria;Database and Artificial Intelligence Group, Vienna University of Technology, Wien, Austria;Oxford University Computing Laboratory, Oxford, United Kingdom;Database and Artificial Intelligence Group, Vienna University of Technology, Wien, Austria;Database and Artificial Intelligence Group, Vienna University of Technology, Wien, Austria;Database and Artificial Intelligence Group, Vienna University of Technology, Wien, Austria;Database and Artificial Intelligence Group, Vienna University of Technology, Wien, Austria
Venue:
BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Year:
2006

Citing 11
Cited 8

Learnability and the Vapnik-Chervonenkis dimension

Journal of the ACM (JACM)
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Symbol Recognition by Error-Tolerant Subgraph Matching between Region Adjacency Graphs

IEEE Transactions on Pattern Analysis and Machine Intelligence - Graph Algorithms and Computer Vision
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Monadic datalog and the expressive power of languages for Web information extraction

Journal of the ACM (JACM)
Toward semantic understanding: an approach based on information extraction ontologies

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
The Lixto data extraction project: back and forth between theory and practice

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient algorithms for processing XPath queries

ACM Transactions on Database Systems (TODS)
Using graph matching techniques to wrap data from PDF documents

Proceedings of the 15th international conference on World Wide Web
A formal comparison of visual web wrapper generators

SOFSEM'06 Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science

Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Towards a System for Ontology-Based Information Extraction from PDF Documents

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
Automatic data record detection in web pages

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Encapsulating multi-stepped web forms as web services

ICSOC/ServiceWave'09 Proceedings of the 2009 international conference on Service-oriented computing
Using ontologies for extracting product features from web pages

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Query induction with schema-guided pruning strategies

The Journal of Machine Learning Research
Towards generic framework for tabular data extraction and management in documents

Proceedings of the sixth workshop on Ph.D. students in information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Lixto project is an ongoing research effort in the area of Web data extraction. Whereas the project originally started out with the idea to develop a logic-based extraction language and a tool to visually define extraction programs from sample Web pages, the scope of the project has been extended over time. Today, new issues such as employing learning algorithms for the definition of extraction programs, automatically extracting data from Web pages featuring a table-centric visual appearance, and extracting from alternative document formats such as PDF are being investigated.