Heuristic and rule-based knowledge acquisition: classification of numeral strings in text

Authors:
Kyongho Min;Stephen MacDonell;Yoo-Jin Moon
Affiliations:
School of Computer and Information Sciences, Auckland University of Technology, New Zealand;School of Computer and Information Sciences, Auckland University of Technology, New Zealand;Department of Management Information Systems, Hankook University of Foreign Studies, Korea
Venue:
PKAW'06 Proceedings of the 9th Pacific Rim Knowledge Acquisition international conference on Advances in Knowledge Acquisition and Management
Year:
2006

Citing 8
Cited 0

An efficient context-free parsing algorithm

Communications of the ACM
Named entity recognition: a maximum entropy approach using global information

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Named entity recognition using an HMM-based chunk tagger

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Japanese Named Entity extraction with redundant morphological analysis

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Efficient deep processing of Japanese

COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
An investigation of various information sources for classifying biological names

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Learning the meaning and usage of time phrases from a parallel text-data corpus

HLT-NAACL-LWM '04 Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data - Volume 6
The semantic knowledge-base of contemporary Chinese and its applications in WSD

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the rule-based classification of numerals and strings that include numerals, composed of a number and semantic unit(s) that indicate a SPEED, NUMBER, or other measure, at three levels: morphological, syntactic, and semantic. The approach employs three interpretation processes: word trigram construction with tokeniser, rule-based processing of number strings, and n-gram based classification. We extracted numeral strings from 378 online newspaper articles, finding that, on average, they comprised about 2.2% of the words in the articles. To manually extract n-gram rules to disambiguate the number strings' meanings, our approach was trained on 886 numeral strings and tested on the remaining 3251 strings. We implemented two heuristic disambiguation methods based on each category's frequency statistics collected from the sample data, and precision ratios of both methods were 86.8% and 86.3% respectively. This paper focuses on the acquisition and performance of different types of rules applied to numeral strings classification.