Applications of corpus-based semantic similarity and word segmentation to database schema matching

Authors:
Aminul Islam;Diana Inkpen;Iluju Kiringa
Affiliations:
School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada;School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2008

Citing 43
Cited 3

A bit-string longest-common-subsequence algorithm

Information Processing Letters
A comparative analysis of methodologies for database schema integration

ACM Computing Surveys (CSUR)
On learning the past tenses of English verbs

Parallel distributed processing: explorations in the microstructure of cognition, vol. 2
Word association norms, mutual information, and lexicography

Computational Linguistics
Class-based n-gram models of natural language

Computational Linguistics
Automated resolution of semantic heterogeneity in multidatabases

ACM Transactions on Database Systems (TODS)
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
WordNet: a lexical database for English

Communications of the ACM
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
IGTree: Using Trees for Compression and Classification in Lazy LearningAlgorithms

Artificial Intelligence Review - Special issue on lazy learning
Similarity-Based Models of Word Cooccurrence Probabilities

Machine Learning - Special issue on natural language learning
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Improving the effectiveness of information retrieval with local context analysis

ACM Transactions on Information Systems (TOIS)
Contextual correlates of synonymy

Communications of the ACM
Matching records in a national medical patient index

Communications of the ACM
Handbook of Natural Language Processing

Handbook of Natural Language Processing
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Using Schema Matching to Simplify Heterogeneous Data Translation

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Unsupervised Acquisition of a Lexicon from Continuous Speech

The Unsupervised Acquisition of a Lexicon from Continuous Speech
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Automatic rule induction for unknown-word guessing

Computational Linguistics
Bitext maps and alignment via pattern recognition

Computational Linguistics
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Word clustering and disambiguation based on co-occurrence data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
A stochastic finite-state word-segmentation algorithm for Chinese

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A Mathematical Theory of Communication

A Mathematical Theory of Communication
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Evaluating WordNet-based Measures of Lexical Semantic Relatedness

Computational Linguistics
Characterising measures of lexical distributional similarity

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Semantic similarity for detecting recognition errors in automatic speech transcripts

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
WordNet: similarity - measuring the relatedness of concepts

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
N-gram similarity and distance

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Semantic text similarity using corpus-based word similarity and string similarity

ACM Transactions on Knowledge Discovery from Data (TKDD)
A comparative analysis of similarity measurement techniques through SimReq framework

Proceedings of the 7th International Conference on Frontiers of Information Technology
Text similarity using google tri-grams

Canadian AI'12 Proceedings of the 25th Canadian conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a method for database schema matching: the problem of identifying elements of two given schemas that correspond to each other. Schema matching is useful in e-commerce exchanges, in data integration/warehousing, and in semantic web applications. We first present two corpus-based methods: one method is for determining the semantic similarity of two target words and the other is for automatic word segmentation. Then we present a name-based element-level database schema matching method that exploits both the semantic similarity and the word segmentation methods. Our word similarity method uses pointwise mutual information (PMI) to sort lists of important neighbor words of two target words; the words which are common in both lists are selected and their PMI values are aggregated to calculate the relative similarity score. Our word segmentation method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward---backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. Finally, we exploit both the semantic similarity and the word segmentation methods in our proposed name-based element-level schema matching method. This method uses a single property (i.e., element name) for schema matching and nevertheless achieves a measure score that is comparable to the methods that use multiple properties (e.g., element name, text description, data instance, context description). Our schema matching method also uses normalized and modified versions of the longest common subsequence string matching algorithm with weight factors to allow for a balanced combination. We validate our methods with experimental studies, the results of which suggest that these methods can be a useful addition to the set of existing methods.