Splitting noun compounds via monolingual and bilingual paraphrasing: a study on Japanese katakana words

Authors:
Nobuhiro Kaji;Masaru Kitsuregawa
Affiliations:
University of Tokyo, Tokyo, Japan;University of Tokyo, Tokyo, Japan
Venue:
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2011

Citing 17
Cited 0

Large Margin Classification Using the Perceptron Algorithm

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
How Effective is Stemming and Decompounding for German Text Retrieval?

Information Retrieval
Machine transliteration

Computational Linguistics
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Empirical methods for compound splitting

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Extracting paraphrases from a parallel corpus

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Paraphrasing with bilingual parallel corpora

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Using the web as an implicit training set: application to structural ambiguity resolution

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Decompounding query keywords from compounding languages

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
A discriminative alignment model for abbreviation recognition

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Using a maximum entropy model to build segmentation lattices for MT

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Semi-supervised lexicon mining from parenthetical expressions in monolingual web pages

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Search engine statistics beyond the n-gram: application to noun compound bracketing

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Unsupervised and knowledge-free learning of compound splits and periphrases

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
German decompounding in a difficult corpus

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Web-scale features for full-scale parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Automatic acquisition of basic katakana lexicon from a given corpus

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often out-of-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.