Key semantics extraction by dependency tree mining

  • Authors:
  • Satoshi Morinaga;Hiroki Arimura;Takahiro Ikeda;Yosuke Sakao;Susumu Akamine

  • Affiliations:
  • NEC Corporation, Kawasaki, Kanagawa, Japan;Hokkaido University, Sapporo, Hokkaido, Japan;NEC Corporation, Kawasaki, Kanagawa, Japan;NEC Corporation, Kawasaki, Kanagawa, Japan;NEC Corporation, Kawasaki, Kanagawa, Japan

  • Venue:
  • Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a new text mining system which extracts characteristic contents from given documents. We define Key semantics as characteristic sub-structures of syntactic dependencies in the given documents, and consider the following three tasks in this paper: 1)Key semantics extraction: extracting characteristic syntactic dependency structures not only as ordered trees but also as unordered trees and free trees, 2)Redundancy reduction: from the result of extraction, deleting redundant dependency structures such as sub-structures or equivalent structures of the others, and 3)Phrase/sentence reconstruction: generating a phrase or sentence in a natural language corresponding to the extracted structure.Our system is a combination of natural language processing techniques and tree mining techniques. The system consists of the following five units: 1) syntactic dependency analysis unit, 2) input filters, 3) characteristic ordered subtree extraction unit, 4) output filters, and 5) phrase/sentence reconstruction unit. Although ordered trees are extracted in the third unit, the overall behavior of the system can be switched into the extraction of ordered trees, unordered trees, or free trees depending on which of the input filters is/are applied in the second step. The output filters delete redundant trees from the extraction result for efficient knowledge discovery. Finally, phrases or sentences corresponding to the extracted subtrees are reconstructed by utilizing the input documents.We demonstrate the validity of our system by showing experimental results using real data collected at a help desk and TDT pilot corpus.