Automatic semantic sequence extraction from unrestricted non-tagged texts

Authors:
Shiho Nobesawa;Hiroaki Saito;Masakazu Nakanishi
Affiliations:
Keio University, Yokohama, Japan;Keio University, Yokohama, Japan;Keio University, Yokohama, Japan
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Year:
2000

Citing 5
Cited 0

A memory-based approach to learning shallow natural language patterns

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
A methodology for automatic term recognition

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Segmenting sentences into linky strings using d-bigram statistics

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mophological processing, syntactic parsing and other useful tools have been proposed in the field of natural language processing (NLP). Many of those NLP tools take dictionary-based approaches. Thus these tools are often not very efficient with texts written in casual wordings or texts which contain many domain-specific terms, because of the lack of vocabulary.In this paper we propose a simple method to obtain domain-specific sequences from unrestricted texts using statistical information only. This method is language-independent.We had experiments on sequence extraction on email texts in Japanese, and succeeded in extracting significant semantic sequences in the test corpus. We tried morphological parsing on the test corpus with ChaSen, a Japanese dictionary-based morphological parser, and examined our system's efficiency in extraction of semantic sequences which were not recognized with ChaSen. Our system detected 69.06% of the unknown words correctly.