UKP: computing semantic textual similarity by combining multiple content similarity measures

Authors:
Daniel Bär;Chris Biemann;Iryna Gurevych;Torsten Zesch
Affiliations:
Ubiquitous Knowledge Processing Lab (UKP-TUDA), Technische Universität Darmstadt;Ubiquitous Knowledge Processing Lab (UKP-TUDA), Technische Universität Darmstadt;Ubiquitous Knowledge Processing Lab (UKP-TUDA), Technische Universität Darmstadt and Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational I ...;Ubiquitous Knowledge Processing Lab (UKP-TUDA), Technische Universität Darmstadt and Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational I ...
Venue:
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Year:
2012

Citing 18
Cited 3

A bit-string longest-common-subsequence algorithm

Information Processing Letters
YAP3: improved detection of similarities in computer program and other texts

SIGCSE '96 Proceedings of the twenty-seventh SIGCSE technical symposium on Computer science education
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Discovery of inference rules for question-answering

Natural Language Engineering
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
METER: MEasuring TExt Reuse

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Corpus-based and knowledge-based measures of text semantic similarity

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Plagiarism detection across distant language pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
The PASCAL recognising textual entailment challenge

MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment
Plagiarism detection using stopword n-grams

Journal of the American Society for Information Science and Technology
SemEval-2012 task 6: a pilot on semantic textual similarity

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation

Building structures from classifiers for passage reranking

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A new benchmark dataset with production methodology for short text semantic similarity algorithms

ACM Transactions on Speech and Language Processing (TSLP)
Knowledge-based graph document modeling

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the UKP system which performed best in the Semantic Textual Similarity (STS) task at SemEval-2012 in two out of three metrics. It uses a simple log-linear regression model, trained on the training data, to combine multiple text similarity measures of varying complexity. These range from simple character and word n-grams and common subsequences to complex features such as Explicit Semantic Analysis vector comparisons and aggregation of word similarity based on lexical-semantic resources. Further, we employ a lexical substitution system and statistical machine translation to add additional lexemes, which alleviates lexical gaps. Our final models, one per dataset, consist of a log-linear combination of about 20 features, out of the possible 300+ features implemented.