Robust segmentation of Japanese text into a lattice for parsing

Authors:
Gary Kacmarcik;Chris Brockett;Hisami Suzuki
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Year:
2000

Citing 5
Cited 2

Natural Language Processing: The Plnlp Approach

Natural Language Processing: The Plnlp Approach
MindNet: acquiring and structuring semantic information from text

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Japanese morphological analyzer using word co-occurrence: JTAG

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Using a broad-coverage parser for word-breaking in Japanese

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2

Using a broad-coverage parser for word-breaking in Japanese

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
English-Japanese example-based machine translation using abstract linguistic representations

COLING-MTIA '02 Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a segmentation component that utilizes minimal syntactic knowledge to produce a lattice of word candidates for a broad coverage Japanese NL parser. The segmenter is a finite state morphological analyzer and text normalizer designed to handle the orthographic variations characteristic of written Japanese, including alternate spellings, script variation, vowel extensions and word-internal parenthetical material. This architecture differs from conventional Japanese wordbreakers in that it does not attempt to simultaneously attack the problems of identifying segmentation candidates and choosing the most probable analysis. To minimize duplication of effort between components and to give the segmenter greater freedom to address orthography issues, the task of choosing the best analysis is handled by the parser, which has access to a much richer set of linguistic information. By maximizing recall in the segmenter and allowing a precision of 34.7%, our parser currently achieves a breaking accuracy of ~97% over a wide variety of corpora.