A corpus-based approach to automatic compound extraction

Authors:
Keh-Yih Su;Ming-Wen Wu;Jing-Shin Chang
Affiliations:
National Tsing-Hua University Hsinchu, Taiwan, R.O.C.;Behavior Design Corporation, Hsinchu, Taiwan, R.O.C.;National Tsing-Hua University Hsinchu, Taiwan, R.O.C.
Venue:
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Year:
1994

Citing 5
Cited 15

Probability and statistics

Probability and statistics
Word association norms, mutual information, and lexicography

Computational Linguistics
A corpus-based approach to automatic compound extraction

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Acquisition of lexical information: from a large textual Italian corpus

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
Surface grammatical analysis for the extraction of terminological noun phrases

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 3

Translating collocations for bilingual lexicons: a statistical approach

Computational Linguistics
Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition

Machine Translation
Unit Completion for a Computer-aided Translation Typing System

Machine Translation
A Corpus-Based Learning Method of Compound Noun Indexing Rules for Korean

Information Retrieval
Unit completion for a computer-aided translation typing system

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A corpus-based approach to automatic compound extraction

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Corpus-based learning of compound noun indexing

RANLPIR '00 Proceedings of the ACL-2000 workshop on Recent advances in natural language processing and information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 11
Using co-occurrence statistics as an information source for partial parsing of Chinese

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Biological relation extraction and query answering from MEDLINE abstracts using ontology-based text mining

Data & Knowledge Engineering
Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Research on Automatic Chinese Multi-word Term Extraction Based on Term Component

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
Research on Automatic Chinese Multi-word Term Extraction Based on Integration of Web Information and Term Component

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
A human-computer collaboration approach to improve accuracy of an automated English scoring system

IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
Tree pattern expression for extracting information from syntactically parsed text corpora

Data Mining and Knowledge Discovery
An ontology-based pattern mining system for extracting information from biological texts

RSFDGrC'05 Proceedings of the 10th international conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

An automatic compound retrieval method is proposed to extract compounds within a text message. It uses n-gram mutual information, relative frequency count and parts of speech as the features for compound extraction. The problem is modeled as a two-class classification problem based on the distributional characteristics of n-gram tokens in the compound and the non-compound clusters. The recall and precision using the proposed approach are 96.2% and 48.2% for bigram compounds and 96.6% and 39.6% for trigram compounds for a testing corpus of 49,314 words. A significant cutdown in processing time has been observed.