A graph approach to measuring text distance

  • Authors:
  • Vivian Yuen-Chong Tsang

  • Affiliations:
  • University of Toronto (Canada)

  • Venue:
  • A graph approach to measuring text distance
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text comparison is a key step in many natural language processing (NLP) applications in which texts can be classified on the basis of their semantic distance (how similar or different the texts are). For example, comparing the local context of an ambiguous word with that of a known word can help identify the sense of the ambiguous word. Typically, a distributional measure is used to capture the implicit semantic distance between two pieces of text. In this thesis, we introduce an alternative method of measuring the semantic distance between texts as a combination of distributional information and relational/ontological knowledge. In this work, we propose a novel distance measure within a network-flow formalism that combines these two distinct components in a way that they are not treated as separate and orthogonal pieces of information. First, we represent each text as a collection of frequency-weighted concepts within a relational thesaurus. Then, we make use of a network-flow method which provides an efficient way of measuring the semantic distance between two texts by taking advantage of the inherently graphical structure in an ontology. We evaluate our method in a variety of NLP tasks. In our task-based evaluation, we find that our method performs well on two of three tasks. We introduce a novel measure which is intended to capture how well our network-flow method perform on a dataset (represented as a collection of frequency-weighted concepts). In our analysis, we find that an integrated approach, rather than a purely distributional or graphical analysis, is more effective in explaining the performance inconsistency. Finally, we address a complexity issue that arises from the overhead required to incorporate more sophisticated concept-to-concept distances into the network-flow framework. We propose a graph transformation method which generates a pared-down network that requires less time to process. The new method achieves a significant speed improvement, and does not seriously hamper performance as a result of the transformation, as indicated in our analysis.