An Algorithm that Learns What‘s in a Name
Machine Learning - Special issue on natural language learning
An efficient context-free parsing algorithm
Communications of the ACM
Named entity recognition using an HMM-based chunk tagger
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Japanese Named Entity extraction with redundant morphological analysis
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Efficient deep processing of Japanese
COLING '02 Proceedings of the 3rd workshop on Asian language resources and international standardization - Volume 12
An investigation of various information sources for classifying biological names
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
The semantic knowledge-base of contemporary Chinese and its applications in WSD
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Comparison of numeral strings interpretation: rule-based and feature-based n-gram methods
AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Hi-index | 0.00 |
This paper describes the interpretation of numerals, and strings including numerals, composed of a number and words or symbols that indicate whether the string is a SPEED, LENGTH, or whatever. The interpretation is done at three levels: lexical, syntactic, and semantic. The system employs three interpretation processes: a word trigram constructor with tokeniser, a rule-based processor of number strings, and n-gram based disambiguation of meanings. We extracted numeral strings from 378 online newspaper articles, finding that, on average, they comprised about 2.2% of the words in the articles. We chose 287 of these articles to provide unseen test data (3251 numeral strings), and used the remaining 91 articles to provide 886 numeral strings for use in manually extracting n-gram constraints to disambiguate the meanings of the numeral strings. We implemented six different disambiguation methods based on category frequency statistics collected from the sample data and on the number of word trigram constraints of each category. Precision ratios for the six methods when applied to the test data ranged from 85.6% to 87.9%.