Estimating sentence types in computer related new product bulletins using a decision tree

  • Authors:
  • Tokunaga Hidekazu;Atlam El-Sayed;Fuketa Masao;Morita Kazuhiro;Tsuda Kazuhiko;Jun-ichi Aoe

  • Affiliations:
  • Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan;Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan and Department of Statistics and Computer science, Faculty of Science, Tanta Universit ...;Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan;Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan;Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan;Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan

  • Venue:
  • Information Sciences—Informatics and Computer Science: An International Journal
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Numerous articles concerning computer related to new product news are present on the Internet. Information extraction and automatic text summarization are necessary for the effective use of these articles. The present paper reveals that the estimation of four sentence types (HATSUBAI [sales], SHIYO [specifications], KOZO [structure], KINO [function]) is an effective as preprocessing for information extraction and automatic text summarization. Moreover, this paper introduces a technique for estimating these sentence types using a decision tree. This decision tree does not involve proper nouns or technical terms but rather verbal nouns and numeratives at the end of sentences, as well as other general words, as attributes. Since sub-setting attribute values is important for creating the decision tree, the sub-setting of the representative decision tree algorithm C4.5 was revised. The gain ratio criterion was changed, and the hill climbing method was replaced with a genetic algorithm. A decision tree was created from 1539 sentences for learning data, and 299 sentences were estimated by the decision tree as test data. The number of incorrectly estimated sentences was 81 when C4.5 used without revision but these number decreased to 70 after revising the sub-setting.