On Precision and Recall of Multi-Attribute Data Extraction from Semistructured Sources

Authors:
Guizhen Yang;Saikat Mukherjee;I. V. Ramakrishnan
Affiliations:
-;-;-
Venue:
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Year:
2003

Citing 16
Cited 2

On the complexity of learning strings and sequences

Theoretical Computer Science
On finding minimal, maximal, and consistent sequences over a binary alphabet

Theoretical Computer Science
Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Learning to Understand Information on the Internet: AnExample-Based Approach

Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Ontology-based extraction and structuring of information from data-rich unstructured documents

Proceedings of the seventh international conference on Information and knowledge management
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
The Complexity of Some Problems on Subsequences and Supersequences

Journal of the ACM (JACM)
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
On the complexity of schema inference from web pages in the presence of nullable data attributes

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management

On the complexity of schema inference from web pages in the presence of nullable data attributes

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Hearsay: enabling audio browsing on hypertext content

Proceedings of the 13th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Machine learning techniques for data extraction fromsemistructured sources exhibit different precision and recallcharacteristics. However to date the formal relationship betweenlearning algorithms and their impact on these twometrics remains unexplored. This paper proposes a formalizationof precision and recall of extraction and investigatesthe complexity-theoretic aspects of learning algorithms formulti-attribute data extraction based on this formalism. Weshow that there is a tradeoff between precision/recall of extractionand computational efficiency and present experimentalresults to demonstrate the practical utility of theseconcepts in designing scalable data extraction algorithmsfor improving recall without compromising on precision.