Text similarity using google tri-grams

Authors:
Aminul Islam;Evangelos Milios;Vlado Kešelj
Affiliations:
Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada;Faculty of Computer Science, Dalhousie University, Halifax, Canada
Venue:
Canadian AI'12 Proceedings of the 25th Canadian conference on Advances in Artificial Intelligence
Year:
2012

Citing 9
Cited 0

How may I help you?

Speech Communication - Special issue on interactive voice technology for telecommunication applications (IVITA '96)
Sentence Similarity Based on Semantic Nets and Corpus Statistics

IEEE Transactions on Knowledge and Data Engineering
Health dialog systems for patients and consumers

Journal of Biomedical Informatics - Special issue: Dialog systems for health communications
Sentence Similarity based on Dynamic Time Warping

ICSC '07 Proceedings of the International Conference on Semantic Computing
Semantic text similarity using corpus-based word similarity and string similarity

ACM Transactions on Knowledge Discovery from Data (TKDD)
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Corpus-based and knowledge-based measures of text semantic similarity

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
A comparative study of two short text semantic similarity measures

KES-AMSTA'08 Proceedings of the 2nd KES International conference on Agent and multi-agent systems: technologies and applications
Word sense disambiguation-based sentence similarity

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

The purpose of this paper is to propose an unsupervised approach for measuring the similarity of texts that can compete with supervised approaches. Finding the inherent properties of similarity between texts using a corpus in the form of a word n-gram data set is competitive with other text similarity techniques in terms of performance and practicality. Experimental results on a standard data set show that the proposed unsupervised method outperforms the state-of-the-art supervised method and the improvement achieved is statistically significant at 0.05 level. The approach is language-independent; it can be applied to other languages as long as n-grams are available.