Creating relational data from unstructured and ungrammatical data sources

Authors:
Matthew Michelson;Craig A. Knoblock
Affiliations:
University of Southern California, Information Sciences Instistute, Marina del Rey, CA;University of Southern California, Information Sciences Instistute, Marina del Rey, CA
Venue:
Journal of Artificial Intelligence Research
Year:
2008

Citing 35
Cited 12

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical

Advances in kernel methods
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning

Machine Learning
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Agents and the Semantic Web

IEEE Intelligent Systems
MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup

EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
S-CREAM - Semi-automatic CREAtion of Metadata

EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards the self-annotating web

Proceedings of the 13th international conference on World Wide Web
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Support vector machine learning for interdependent and structured output spaces

ICML '04 Proceedings of the twenty-first international conference on Machine learning
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Large Margin Methods for Structured and Interdependent Output Variables

The Journal of Machine Learning Research
Automatically utilizing secondary sources to align information across sources

AI Magazine - Special issue on semantic integration
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficiently linking text documents with relevant structured information

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

International Journal on Document Analysis and Recognition
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Semantic annotation of unstructured and ungrammatical text

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Automatic creation and simplified querying of semantic web content: an approach based on information-extraction ontologies

ASWC'06 Proceedings of the First Asian conference on The Semantic Web

Information Extraction

Foundations and Trends in Databases
Exploiting background knowledge to build reference sets for information extraction

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Generalized expectation criteria for bootstrapping extractors using record-text alignment

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Constructing reference sets from unstructured, ungrammatical text

Journal of Artificial Intelligence Research
Matching unstructured product offers to structured product specifications

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatically generating data linkages using a domain-independent candidate selection approach

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Aggregating web offers to determine product prices

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
An approach for named entity recognition in poorly structured data

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Web-based closed-domain data extraction on online advertisements

Information Systems
Scalable and domain-independent entity coreference: establishing high quality data linkages across heterogeneous data sources

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying dificult. Examples of these types of data sources are online classifieds like Craigslist1 and auction item listings like eBay.2 We call this unstructured, ungrammatical data "posts." The unstructured nature of posts makes query and integration dificult because the attributes are embedded within the text. Also, these attributes do not conform to standardized values, which prevents queries based on a common attribute value. The schema is unknown and the values may vary dramatically making accurate search dificult. Creating relational data for easy querying requires that we define a schema for the embedded attributes and extract values from the posts while standardizing these values. Traditional information extraction (IE) is inadequate to perform this task because it relies on clues from the data, such as structure or natural language, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to accurately query and integrate the source. The two-step approach described in this paper creates relational data sets from unstructured and ungrammatical text by addressing both issues. To do this, we require a set of known entities called a "reference set." The first step aligns each post to each member of each reference set. This allows our algorithm to define a schema over the post and include standard values for the attributes defined by this schema. The second step performs information extraction for the attributes, including attributes not easily represented by reference sets, such as a price. In this manner we create a relational structure over previously unstructured data, supporting deep and accurate queries over the data as well as standard values for integration. Our experimental results show that our technique matches the posts to the reference set accurately and eficiently and outperforms state-of-the-art extraction systems on the extraction task from posts.