Joining statistics with NLP for text categorization

Authors:
Paul S. Jacobs
Affiliations:
GE Research and Development Center, Schenectady, NY
Venue:
ANLC '92 Proceedings of the third conference on Applied natural language processing
Year:
1992

Citing 6
Cited 7

SCISOR: extracting information from on-line news

Communications of the ACM
Lexico-semantic pattern matching as a companion to parsing in text understanding

HLT '91 Proceedings of the workshop on Speech and Natural Language
Creating segmented databases from free text for text retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories

IAAI '90 Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence
GE: description of the NLTooLSET system as used for MUC-3

MUC3 '91 Proceedings of the 3rd conference on Message understanding

Mining Text Using Keyword Distributions

Journal of Intelligent Information Systems
On feature distributional clustering for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
Modeling content identification from document images

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A syntactically-based query reformulation technique for information retrieval

Information Processing and Management: an International Journal
Parsing run amok: relation-driven control for text analysis

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence
Emotion Sensitive News Agent (ESNA): A system for user centric emotion sensing from the news

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic news categorization systems have produced high accuracy, consistency, and flexibility using some natural language processing techniques. These knowledge-based categorization methods are more powerful and accurate than statistical techniques. However, the phrasal pre-processing and pattern matching methods that seem to work for categorization have the disadvantage of requiring a fair amount of knowledge-encoding by human beings. In addition, they work much better at certain tasks, such as identifying major events in texts, than at others, such as determining what sort of business or product is involved in a news event.Statistical methods for categorization, on the other hand, are easy to implement and require little or no human customization. But they don't offer any of the benefits of natural language processing, such as the ability to identify relationships and enforce linguistic constraints.Our approach has been to use statistics in the knowledge acquisition component of a linguistic pattern-based categorization system, using statistical methods, for example, to associate words with industries and identify phrases that information about businesses or products. Instead of replacing knowledge-based methods with statistics, statistical training replaces knowledge engineering. This has resulted in high accuracy, shorter customization time, and good prospects for the application of the statistical methods to problems in lexical acquisition.