Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical
Advances in kernel methods
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
IEEE Intelligent Systems
MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup
EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
S-CREAM - Semi-automatic CREAtion of Metadata
EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TAILOR: A Record Linkage Tool Box
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards the self-annotating web
Proceedings of the 13th international conference on World Wide Web
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Support vector machine learning for interdependent and structured output spaces
ICML '04 Proceedings of the twenty-first international conference on Machine learning
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Large Margin Methods for Structured and Interdependent Output Variables
The Journal of Machine Learning Research
Automatically utilizing secondary sources to align information across sources
AI Magazine - Special issue on semantic integration
Integrating Unstructured Data into Relational Databases
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficiently linking text documents with relevant structured information
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Adaptive Blocking: Learning to Scale Up Record Linkage
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
International Journal on Document Analysis and Recognition
Learning blocking schemes for record linkage
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Adaptive information extraction from text by rule induction and generalisation
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Semantic annotation of unstructured and ungrammatical text
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
ASWC'06 Proceedings of the First Asian conference on The Semantic Web
Foundations and Trends in Databases
Exploiting background knowledge to build reference sets for information extraction
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Generalized expectation criteria for bootstrapping extractors using record-text alignment
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Constructing reference sets from unstructured, ungrammatical text
Journal of Artificial Intelligence Research
Matching unstructured product offers to structured product specifications
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatically generating data linkages using a domain-independent candidate selection approach
ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Aggregating web offers to determine product prices
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
An approach for named entity recognition in poorly structured data
ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Web-based closed-domain data extraction on online advertisements
Information Systems
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying dificult. Examples of these types of data sources are online classifieds like Craigslist1 and auction item listings like eBay.2 We call this unstructured, ungrammatical data "posts." The unstructured nature of posts makes query and integration dificult because the attributes are embedded within the text. Also, these attributes do not conform to standardized values, which prevents queries based on a common attribute value. The schema is unknown and the values may vary dramatically making accurate search dificult. Creating relational data for easy querying requires that we define a schema for the embedded attributes and extract values from the posts while standardizing these values. Traditional information extraction (IE) is inadequate to perform this task because it relies on clues from the data, such as structure or natural language, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to accurately query and integrate the source. The two-step approach described in this paper creates relational data sets from unstructured and ungrammatical text by addressing both issues. To do this, we require a set of known entities called a "reference set." The first step aligns each post to each member of each reference set. This allows our algorithm to define a schema over the post and include standard values for the attributes defined by this schema. The second step performs information extraction for the attributes, including attributes not easily represented by reference sets, such as a price. In this manner we create a relational structure over previously unstructured data, supporting deep and accurate queries over the data as well as standard values for integration. Our experimental results show that our technique matches the posts to the reference set accurately and eficiently and outperforms state-of-the-art extraction systems on the extraction task from posts.