Information extraction as a basis for high-precision text classification

Authors:
Ellen Riloff;Wendy Lehnert
Affiliations:
Univ. of Massachusetts, Amherst;Univ. of Massachusetts, Amherst
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
1994

Citing 19
Cited 50

The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval

Journal of the American Society for Information Science
Word sense disambiguation using machine-readable dictionaries

SIGIR '89 Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
The use of phrases and structured queries in information retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Creating segmented databases from free text for text retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval performance in Ferret a conceptual information retrieval system

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
An evaluation of text analysis technologies

AI Magazine
Information filtering and information retrieval: two sides of the same coin?

Communications of the ACM - Special issue on information filtering
Using cases to represent context for text classification

CIKM '93 Proceedings of the second international conference on Information and knowledge management
Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
Automatic Document Classification

Journal of the ACM (JACM)
Modeling Legal Arguments: Reasoning with Cases and Hypotheticals

Modeling Legal Arguments: Reasoning with Cases and Hypotheticals
Prism: A Case-Based Telex Classifier

IAAI '90 Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence
CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories

IAAI '90 Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Computational aspects of discourse in the context of MUC-3

MUC3 '91 Proceedings of the 3rd conference on Message understanding
UMass/Hughes: description of the CIRCUS system used for MUC-5

MUC5 '93 Proceedings of the 5th conference on Message understanding
University of Massachusetts: description of the CIRCUS system as used for MUC-4

MUC4 '92 Proceedings of the 4th conference on Message understanding

Little words can make a big difference for text classification

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Document classification using multiword features

Proceedings of the seventh international conference on Information and knowledge management
A Value-Driven System for Autonomous Information Gathering

Journal of Intelligent Information Systems
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A framework for specifying explicit bias for revision of approximate information extraction rules

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
A natural language interface for information retrieval from forms on the World Wide Web

ICIS '99 Proceedings of the 20th international conference on Information Systems
Information extraction for Thai documents

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Concept-based knowledge discovery in texts extracted from the Web

ACM SIGKDD Explorations Newsletter
Querying Documents using Content, Structure and Properties

Journal of Intelligent Information Systems
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Effective Text Retrieval Based on Combining Evidence from the Corpus and Users

IEEE Expert: Intelligent Systems and Their Applications
Text Categorization: An Experiment Using Phrases

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Classify Web Document by Key Phrase Understanding

WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Lazy Learning Algorithms for Problems with Many Binary Features and Classes

IBERAMIA '98 Proceedings of the 6th Ibero-American Conference on AI: Progress in Artificial Intelligence
Where to Position the Precision in Knowledge Extraction from Text

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
Recognizing Ontology-Applicable Multiple-Record Web Documents

ER '01 Proceedings of the 20th International Conference on Conceptual Modeling: Conceptual Modeling
Incremental context mining for adaptive document classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

World Wide Web
Event detection from online news documents for supporting environmental scanning

Decision Support Systems - Special issue: Knowledge management technique
Information Extraction from the Web: System and Techniques

Applied Intelligence
TopCat: Data Mining for Topic Identification in a Text Corpus

IEEE Transactions on Knowledge and Data Engineering
A new structure for news editing

IBM Systems Journal
Toward semantic understanding: an approach based on information extraction ontologies

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
A knowledge-based approach to text classification

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Approaches to text mining for clinical medical records

Proceedings of the 2006 ACM symposium on Applied computing
Generalizing from relevance feedback using named entity wildcards

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A machine learning approach to web page filtering using content and structure analysis

Decision Support Systems
A web-based multi-agent system approach to document engineering

International Journal of Web Engineering and Technology
Extracting clinical trial design information from MEDLINE abstracts

New Generation Computing
Toward incorporating a task-stage identification technique into the long-term document support process

Information Processing and Management: an International Journal
PubMed smarter: Query expansion with implicit words based on gene ontology

Knowledge-Based Systems
Context-Based Term Frequency Assessment for Text Classification

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Using phrases as features in email classification

Journal of Systems and Software
An interactive and user-centered computer system to predict physician's disease judgments in discharge summaries

Journal of Biomedical Informatics
A framework for text processing and supporting access to collections of digitized historical newspapers

Proceedings of the 2007 conference on Human interface: Part II
Part-whole reasoning in an object-centered framework

Part-whole reasoning in an object-centered framework
Information extraction based multiple-category document classification for the global legal information network

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
A concept-driven biomedical knowledge extraction and visualization framework for conceptualization of text corpora

Journal of Biomedical Informatics
The role of information extraction in the design of a document triage application for biocuration

BioNLP '11 Proceedings of BioNLP 2011 Workshop
High-precision phrase-based document classification on a modern scale

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Knowledge and reasoning for question answering: Research perspectives

Information Processing and Management: an International Journal
Application of text categorization to astronomy field

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Mining and supporting task-stage knowledge: a hierarchical clustering technique

PAKM'06 Proceedings of the 6th international conference on Practical Aspects of Knowledge Management
Building systems to block pornography

IM'99 Proceedings of the 1999 international conference on Challenge of Image Retrieval
The automatic generation of templates for automatic abstracting

IRSG'99 Proceedings of the 21st Annual BCS-IRSG conference on Information Retrieval Research
Concept comparison engines: A new frontier of search

Decision Support Systems
Audience targeting by B-to-B advertisement classification: A neural network approach

Expert Systems with Applications: An International Journal
What's buzzing in the blizzard of buzz? Automotive component isolation in social media postings

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe an approach to text classification that represents a compromise between traditional word-based techniques and in-depth natural language processing. Our approach uses a natural language processing task called “information extraction” as a basis for high-precision text classification. We present three algorithms that use varying amounts of extracted information to classify texts. The relevancy signatures algorithm uses linguistic phrases; the augmented relevancy signatures algorithm uses phrases and local context; and the case-based text classification algorithm uses larger pieces of context. Relevant phrases and contexts are acquired automatically using a training corpus. We evaluate the algorithms on the basis of two test sets from the MUC-4 corpus. All three algorithms achieved high precision on both test sets, with the augmented relevancy signatures algorithm and the case-based algorithm reaching 100% precision with over 60% recall on one set. Additionally, we compare the algorithms on a larger collection of 1700 texts and describe an automated method for empirically deriving appropriate threshold values. The results suggest that information extraction techniques can support high-precision text classification and, in general, that using more extracted information improves performance. As a practical matter, we also explain how the text classification system can be easily ported across domains.