Fundamentals of speech recognition
Fundamentals of speech recognition
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Maximum Entropy Markov Models for Information Extraction and Segmentation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
TheaterLoc: Using Information Integration Technology to Rapidly Build Virtual Applications
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Nymble: a high-performance learning name-finder
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Data Mining for Web Intelligence
Computer
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Comparative study of name disambiguation problem using a scalable blocking-based framework
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Survey of semantic annotation platforms
Proceedings of the 2005 ACM symposium on Applied computing
Effective and scalable solutions for mixed and split citation problems in digital libraries
Proceedings of the 2nd international workshop on Information quality in information systems
Tuning schema matching software using synthetic scenarios
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Browsing mixed structured and unstructured data
Information Processing and Management: an International Journal
Tagging of name records for genealogical data browsing
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Creating probabilistic databases from information extraction models
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
eTuner: tuning schema matching software using synthetic scenarios
The VLDB Journal — The International Journal on Very Large Data Bases
The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions
Journal of the American Society for Information Science and Technology
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Resume information extraction with cascaded hybrid model
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Reference metadata extraction using a hierarchical knowledge representation framework
Decision Support Systems
Soft pattern matching models for definitional question answering
ACM Transactions on Information Systems (TOIS)
Techniques to incorporate the benefits of a hierarchy in a modified hidden Markov model
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
On string classification in data streams
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Merging the results of approximate match operations
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Probabilistic graphical models and their role in databases
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
From dirt to shovels: fully automatic tool generation from ad hoc data
Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A genetic algorithm for segmentation and information retrieval of SEC regulatory filings
dg.o '08 Proceedings of the 2008 international conference on Digital government research
A simple method for citation metadata extraction using hidden markov models
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Boosting text segmentation via progressive classification
Knowledge and Information Systems
Foundations and Trends in Databases
Automatic metadata extraction from museum specimen labels
DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Ad Hoc Data and the Token Ambiguity Problem
PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
Attribute-value specification in customs fraud detection: a human-aided approach
Proceedings of the 10th Annual International Conference on Digital Government Research: Social Networks: Making Connections between Citizens, Data and Government
Address standardization with latent semantic association
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A grammar-based entity representation framework for data cleaning
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Identification of Duplicate Bibliographical References
Proceedings of the 2005 conference on Advances in Logic Based Intelligent Systems: Selected Papers of LAPTEC 2005
Learning field compatibilities to extract database records from unstructured text
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Constraint-based entity matching
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Creating relational data from unstructured and ungrammatical data sources
Journal of Artificial Intelligence Research
Semantic annotation of unstructured and ungrammatical text
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Information Extraction from Text Based on Semantic Inferentialism
FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Supporting data exploration in databases
ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Browsing mixed structured and unstructured data
Information Processing and Management: an International Journal
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
Pattern-based extraction of addresses from web page content
APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
ONDUX: on-demand unsupervised learning for information extraction
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Unsupervised strategies for information extraction by text segmentation
Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
Efficient duplicate record detection based on similarity estimation
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Identification of rhetorical roles for segmentation and summarization of a legal judgment
Artificial Intelligence and Law
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
A probabilistic approach for automatically filling form-based web interfaces
Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists
Proceedings of the fourth ACM international conference on Web search and data mining
A trigram hidden Markov model for metadata extraction from heterogeneous references
Information Sciences: an International Journal
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Joint unsupervised structure discovery and information extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Web information extraction using markov logic networks
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Semi-supervised multi-task learning of structured prediction models for web information extraction
Proceedings of the 20th ACM international conference on Information and knowledge management
Exploring the corporate ecosystem with a semi-supervised entity graph
Proceedings of the 20th ACM international conference on Information and knowledge management
Metadata extraction from bibliographies using bigram HMM
ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Automated dictionary discovery for the online marketplace
Proceedings of the 2012 iConference
Privacy compliance enforcement in email
AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Self-supervised learning approach for extracting citation information on the web
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
P-top-k queries in a probabilistic framework from information extraction models
Computers & Mathematics with Applications
Learning to predict from textual data
Journal of Artificial Intelligence Research
ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like “City” and “Street”. Existing tools rely on hand-tuned, domain-specific rule-based systems.We describe a tool DATAMOLD that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life datasets yielded accuracy of 90% on Asian addresses and 99% on US addresses. In contrast, existing information extraction methods based on rule-learning techniques yielded considerably lower accuracy.