DEByE - Date extraction by example

Authors:
Alberto H. F. Laender;Berthier Ribeiro-Neto;Altigran S. da Silva
Affiliations:
Department of Computer Science, Federal University of Minas Gerais, ICEx-UFMG, Caixa Postal 702, Belo Horizonte MG, Brazil;Department of Computer Science, Federal University of Minas Gerais, ICEx-UFMG, Caixa Postal 702, Belo Horizonte MG, Brazil;Department of Computer Science, Federal University of Minas Gerais, ICEx-UFMG, Caixa Postal 702, Belo Horizonte MG, Brazil
Venue:
Data & Knowledge Engineering
Year:
2002

Citing 35
Cited 50

A relational algebra for complex objects based on partial information

MFDBS 91 Proceedings of the 3rd symposium on Mathematical fundamentals of database and knowledge base systems
Passage-level evidence in document retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A query language and optimization techniques for unstructured data

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Passage retrieval revisited

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Semistructured data

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Cut and paste

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ontology-based extraction and structuring of information from data-rich unstructured documents

Proceedings of the seventh international conference on Information and knowledge management
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Extracting semi-structured data through examples

Proceedings of the eighth international conference on Information and knowledge management
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
CONVERT: a high level translation definition language for data conversion

Communications of the ACM
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Modern Information Retrieval

Modern Information Retrieval
Form operation by example: a language for office information processing

SIGMOD '81 Proceedings of the 1981 ACM SIGMOD international conference on Management of data
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Wrapper Generation for Web Accessible Data Sources

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
A Conceptual-Modeling Approach to Extracting Data from the Web

ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
Top-Down Extraction of Semi-Structured Data

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Wrapper induction for information extraction

Wrapper induction for information extraction
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

Monadic datalog and the expressive power of languages for web information extraction

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A brief survey of web data extraction tools

ACM SIGMOD Record
Web-DL: an experience in building digital libraries from the web

Proceedings of the eleventh international conference on Information and knowledge management
Collecting hidden weeb pages for data extraction

Proceedings of the 4th international workshop on Web information and data management
The Debye Environment for Web Data Management

IEEE Internet Computing
Using Nested Tables for Representing and Querying Semistructured Web Data

CAiSE '02 Proceedings of the 14th International Conference on Advanced Information Systems Engineering
A Framework for Generating Attribute Extractors for Web Data Sources

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Representing and Querying Semistructured Web Data Using Nested Tables with Structural Variants

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
The Web-DL environment for building digital libraries from the Web

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Monadic datalog and the expressive power of languages for Web information extraction

Journal of the ACM (JACM)
Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries

ACM Transactions on Information Systems (TOIS)
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Towards building logical views of websites

Data & Knowledge Engineering - Special issue: WIDM 2002
Personalized Web Views for Multilingual Web Sources

IEEE Internet Computing
Logic-based web information extraction

ACM SIGMOD Record
A Bayesian network approach to searching Web databases through keyword-based queries

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
The Lixto data extraction project: back and forth between theory and practice

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
L-tree match: a new data extraction model and algorithm for huge text stream with noises

Journal of Computer Science and Technology
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Automated data extraction from the web with conditional models

International Journal of Business Intelligence and Data Mining
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
Categorisation of web documents using extraction ontologies

International Journal of Metadata, Semantics and Ontologies
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
On-the-Fly Integration and Ad Hoc Querying of Life Sciences Databases Using LifeDB

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
FastWrap: an efficient wrapper for tabular data extraction from the web

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Visual extraction of information from web pages

Journal of Visual Languages and Computing
Proposing of modular system for web information extraction

CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
WMS-extracting multiple sections data records from search engine results pages

Proceedings of the 2010 ACM Symposium on Applied Computing
An effective method supporting data extraction and schema recognition on deep web

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Web news extraction based on path pattern mining

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Integrating large and distributed life sciences resources for systems biology research: progress and new challenges

Transactions on large-scale data- and knowledge-centered systems III
Developer-friendly annotation-based HTML-to-XML transformation technology

Proceedings of the 11th ACM symposium on Document engineering
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
A simhash-based scheme for locating product information from the web

Proceedings of the Second Symposium on Information and Communication Technology
A formal comparison of visual web wrapper generators

SOFSEM'06 Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science
Learning robust web wrappers

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Logic wrappers and XSLT transformations for tuples extraction from HTML

XSym'05 Proceedings of the Third international conference on Database and XML Technologies
DART: a data acquisition and repairing tool

EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology
Chapter 6: web data extraction for service creation

Search Computing
Integrated visualization framework for relational databases and web resources

IHI'04 Proceedings of the 2004 international conference on Intuitive Human Interfaces for Organizing and Accessing Intellectual Assets
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
AMBER: turning annotations into knowledge

Proceedings of the 21st international conference companion on World Wide Web
A framework for populating ontological models from semi-structured web documents

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

International Journal of Data Warehousing and Mining
A general theory of spatial relations to support a graphical tool for visual information extraction

Journal of Visual Languages and Computing
DEiXTo: a web data extraction suite

Proceedings of the 6th Balkan Conference in Informatics
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present DEByE(Data Extraction By Example), an approach to extracting data from Web sources, based on a small set of examples specified by the user. The novelty is in the fact that the user specifies examples according to a structure of his liking and that this structure is described at example specification time. For the specification of the examples, the user interacts with a tool we developed which adopts nested tables as its visual paradigm. Nested tables are simple, intuitive, and allow shielding the user from technical details (such as HTML tags, formatting operators, and learning automata) related to the extraction problem. The examples provided by the user are then used to generate patterns which allow extracting data from new documents. For the extraction, DEByE adopts a new bottom-up procedure we proposed which is very effective with various Web sources, as demonstrated by our experiments.