Automatic segmentation of text into structured records

Authors:
Vinayak Borkar;Kaustubh Deshmukh;Sunita Sarawagi
Affiliations:
Indian Institute of Technology, Bombay;University of Washington, Seattle and Indian Institute of Technology, Bombay;Indian Institute of Technology, Bombay
Venue:
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Year:
2001

Citing 16
Cited 69

Fundamentals of speech recognition

Fundamentals of speech recognition
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Dealing with dirty data

DBMS
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Digital Libraries and Autonomous Citation Indexing

Computer
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
TheaterLoc: Using Information Integration Technology to Rapidly Build Virtual Applications

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing

Data Mining for Web Intelligence

Computer
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Survey of semantic annotation platforms

Proceedings of the 2005 ACM symposium on Applied computing
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Tuning schema matching software using synthetic scenarios

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Browsing mixed structured and unstructured data

Information Processing and Management: an International Journal
Tagging of name records for genealogical data browsing

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Creating probabilistic databases from information extraction models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
eTuner: tuning schema matching software using synthetic scenarios

The VLDB Journal — The International Journal on Very Large Data Bases
The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions

Journal of the American Society for Information Science and Technology
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Resume information extraction with cascaded hybrid model

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Reference metadata extraction using a hierarchical knowledge representation framework

Decision Support Systems
Soft pattern matching models for definitional question answering

ACM Transactions on Information Systems (TOIS)
Techniques to incorporate the benefits of a hierarchy in a modified hidden Markov model

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
On string classification in data streams

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Probabilistic graphical models and their role in databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
From dirt to shovels: fully automatic tool generation from ad hoc data

Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A genetic algorithm for segmentation and information retrieval of SEC regulatory filings

dg.o '08 Proceedings of the 2008 international conference on Digital government research
A simple method for citation metadata extraction using hidden markov models

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Boosting text segmentation via progressive classification

Knowledge and Information Systems
Information Extraction

Foundations and Trends in Databases
Automatic metadata extraction from museum specimen labels

DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Ad Hoc Data and the Token Ambiguity Problem

PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
Attribute-value specification in customs fraud detection: a human-aided approach

Proceedings of the 10th Annual International Conference on Digital Government Research: Social Networks: Making Connections between Citizens, Data and Government
Address standardization with latent semantic association

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A grammar-based entity representation framework for data cleaning

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Identification of Duplicate Bibliographical References

Proceedings of the 2005 conference on Advances in Logic Based Intelligent Systems: Selected Papers of LAPTEC 2005
Learning field compatibilities to extract database records from unstructured text

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Constraint-based entity matching

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
Semantic annotation of unstructured and ungrammatical text

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Information Extraction from Text Based on Semantic Inferentialism

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Supporting data exploration in databases

ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Browsing mixed structured and unstructured data

Information Processing and Management: an International Journal
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Pattern-based extraction of addresses from web page content

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
ONDUX: on-demand unsupervised learning for information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Unsupervised strategies for information extraction by text segmentation

Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
Efficient duplicate record detection based on similarity estimation

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Identification of rhetorical roles for segmentation and summarization of a legal judgment

Artificial Intelligence and Law
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
A probabilistic approach for automatically filling form-based web interfaces

Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
A trigram hidden Markov model for metadata extraction from heterogeneous references

Information Sciences: an International Journal
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Joint unsupervised structure discovery and information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Web information extraction using markov logic networks

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Performing information extraction to improve OCR error detection in semi-structured historical documents

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Semi-supervised multi-task learning of structured prediction models for web information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
Exploring the corporate ecosystem with a semi-supervised entity graph

Proceedings of the 20th ACM international conference on Information and knowledge management
Metadata extraction from bibliographies using bigram HMM

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Automated dictionary discovery for the online marketplace

Proceedings of the 2012 iConference
Privacy compliance enforcement in email

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Self-supervised learning approach for extracting citation information on the web

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
P-top-k queries in a probabilistic framework from information extraction models

Computers & Mathematics with Applications
Learning to predict from textual data

Journal of Artificial Intelligence Research
Exploiting a proximity-based positional model to improve the quality of information extraction by text segmentation

ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like “City” and “Street”. Existing tools rely on hand-tuned, domain-specific rule-based systems.We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.