Using a broad-coverage parser for word-breaking in Japanese

Authors:
Hisami Suzuki;Chris Brockett;Gary Kacmarcik
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Year:
2000

Citing 8
Cited 3

Taggers for parsers

Artificial Intelligence - Special volume on empirical methods
Determining similarity and inferring relations in a lexical knowledge base

Determining similarity and inferring relations in a lexical knowledge base
Natural Language Processing: The Plnlp Approach

Natural Language Processing: The Plnlp Approach
Augmented phrase structure grammars

TINLAP '75 Proceedings of the 1975 workshop on Theoretical issues in natural language processing
MindNet: acquiring and structuring semantic information from text

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Japanese morphological analyzer using word co-occurrence: JTAG

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Robust segmentation of Japanese text into a lattice for parsing

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Robust segmentation of Japanese text into a lattice for parsing

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
English-Japanese example-based machine translation using abstract linguistic representations

COLING-MTIA '02 Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16
Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a method of word segmentation in Japanese in which a broad-coverage parser selects the best word sequence while producing a syntactic analysis. This technique is substantially different from traditional statistics- or heuristics-based models which attempt to select the best word sequence before handing it to the syntactic component. By breaking up the task of finding the best word sequence into the identification of words (in the word-breaking component) and the selection of the best sequence (a by-product of parsing), we have been able to simplify the task of each component and achieve high accuracy over a wide variety of data. Word-breaking accuracy of our system is currently around 97-98%.