Context representation using word sequences extracted from a news corpus

  • Authors:
  • Hiroshi Sekiya;Takeshi Kondo;Makoto Hashimoto;Tomohiro Takagi

  • Affiliations:
  • Department of Computer Science, Meiji University, 1-1-1, Higashi Mita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan;Department of Computer Science, Meiji University, 1-1-1, Higashi Mita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan;Department of Computer Science, Meiji University, 1-1-1, Higashi Mita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan;Department of Computer Science, Meiji University, 1-1-1, Higashi Mita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan

  • Venue:
  • International Journal of Approximate Reasoning
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The ambiguity in language is one of the most difficult problems in dealing with word senses using computers. Word senses vary dynamically depending on context. We need to specify the context to identify these. However, context also varies depending on specificity and the viewpoint of the topic. Therefore, generally speaking, people pay attention to the part of the attributes of the entity, which the dictionary definition of the word indicates, depending on such variant contexts. Dealing with word senses on computer can be split into two steps. The first is to determine all the different senses for every word, and the second is to assign each occurrence of a word to the appropriate sense. In this paper, we propose a method focusing on the first step, which is to generate atomic conceptual fuzzy sets using word sequences. Then, both contexts identified by word sequences and atomic conceptual fuzzy sets, which express word senses, and that are related to the contexts can be shown concretely. We used the Reuters collection consisting of 800,000 news articles, and extracted word sequences and generated fuzzy sets automatically using the confabulation model (a prediction method similar to the n-gram model) and five statistical measures as relations. We compared the compatibility between the confabulation model and each measure, and found that cogency and mutual information were the most effective in representing context. We demonstrate the usefulness of the word sequences to identify the context.