Semantic duplicate identification with parsing and machine learning

Authors:
Sven Hartrumpf;Tim Vor Der Brück;Christian Eichhorn
Affiliations:
Intelligent Information and Communication Systems, FernUniversität in Hagen, Hagen, Germany;Intelligent Information and Communication Systems, FernUniversität in Hagen, Hagen, Germany;Lehrstuhl Informatik 1, Technische Universität Dortmund, Dortmund, Germany
Venue:
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Year:
2010

Citing 2
Cited 1

Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Knowledge Representation and the Semantics of Natural Language (Cognitive Technologies)

Knowledge Representation and the Semantics of Natural Language (Cognitive Technologies)

Determining and characterizing the reused text for plagiarism detection

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized index. In order to detect many kinds of paraphrases the semantic networks of a candidate text are varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Important phenomena occurring in difficult duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora is explained briefly. The deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee a high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably in comparison to traditional shallow methods.