Communications of the ACM
Automatic text decomposition using text segments and text themes
Proceedings of the the seventh ACM conference on Hypertext
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Extracting schema from semistructured data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Interaction between path and type constraints
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Regression testing for wrapper maintenance
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Relational learning of pattern-match rules for information extraction
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Automatically extracting structure and data from business reports
Proceedings of the eighth international conference on Information and knowledge management
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Learning to extract hierarchical information from semi-structured documents
Proceedings of the ninth international conference on Information and knowledge management
Path constraints in semistructured databases
Journal of Computer and System Sciences - Special issue on the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems
Constraints for semistructured data and XML
ACM SIGMOD Record
Adapting integrity enforcement techniques for data reconciliation
Information Systems - Data extraction, cleaning and reconciliation
QuASM: a system for question answering using semi-structured data
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Data Quality for the Information Age
Data Quality for the Information Age
QURSED: querying and reporting semistructured data
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Incremental Validation of XML Documents
ICDT '03 Proceedings of the 9th International Conference on Database Theory
Learning the Common Structure of Data
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Schema-guided wrapper maintenance for web-data extraction
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Beyond accuracy: what data quality means to data consumers
Journal of Management Information Systems
RRXS: redundancy reducing XML storage in relations
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Hi-index | 0.00 |
In this paper, we propose the use of semistructured constraints in wrappers to mitigate the impact of poor extraction accuracy on Cooperative Information System (CIS) data quality. Wrappers are a critical element of CISs whenever the constituent information systems publish semistructured text such as forms, reports, and memos rather than structured databases. The accuracy of CIS data that stem from text depends upon the wrappers as well as the accuracy of the underlying sources. Wrapper specification is the process of defining patterns (i.e. regular expressions) to extract information from semistructured text. Wrapper verification is the process of ensuring extraction accuracy--that the extracted information faithfully reflects the underlying source. We focus on the problem of extraction accuracy. We use constraints on semistructured data for both wrapper specification and verification. Consequently, we perform extraction and verification simultaneously. We apply the concept to wrappers for a Uniform Domain Name Dispute Resolution Policy (UDRP) CIS of arbitration decisions. UDRP decisions are currently distributed across arbitration authorities on three continents. The accuracy of data extracted using constraint-based specification and verification is measured by Type I and Type II errors.