An evaluation of phrasal and clustered representations on a text categorization task

Authors:
David D. Lewis
Affiliations:
Center for Information and Language Studies, University of Chicago, Chicago, IL
Venue:
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
1992

Citing 14
Cited 116

Classification algorithms

Classification algorithms
A cluster-based approach to thesaurus construction

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
The automatic indexing system AIR/PHYS - from research to applications

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Models for retrieval with probabilistic indexing

Information Processing and Management: an International Journal - Modeling data, information and knowledge
Probabilistic document indexing from relevance feedback data

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments with query acquisition and use in document retrieval systems

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Term clustering of syntactic phrases

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
The use of phrases and structured queries in information retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Representation and learning in information retrieval

Representation and learning in information retrieval
Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
Information Retrieval

Information Retrieval
CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories

IAAI '90 Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing

Information filtering and information retrieval: two sides of the same coin?

Communications of the ACM - Special issue on information filtering
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Towards language independent automated learning of text categorization models

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Partial orders for document representation: a new methodology for combining document features

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based text categorization: a comparison of category search strategies

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Natural language processing for information retrieval

Communications of the ACM
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Automatic essay grading using text categorization techniques

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Mining Text Using Keyword Distributions

Journal of Intelligent Information Systems
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
Event tracking based on domain dependency

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Web page classification based on k-nearest neighbor approach

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Text categorization using hybrid (mined) terms (poster session)

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A model of multimedia information retrieval

Journal of the ACM (JACM)
Evaluating document clustering for interactive information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Combining and selecting characteristics of information use

Journal of the American Society for Information Science and Technology
Topic-oriented collaborative crawling

Proceedings of the eleventh international conference on Information and knowledge management
Hierarchical Text Categorization Using Neural Networks

Information Retrieval
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
ACIRD: Intelligent Internet Document Organization and Retrieval

IEEE Transactions on Knowledge and Data Engineering
Using Statistical Methods to Improve Knowledge-Based News Categorization

IEEE Expert: Intelligent Systems and Their Applications
Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Text Categorization: An Experiment Using Phrases

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Second Order Features for Maximising Text Classification Performance

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Information Access Based on Associative Calculation

SOFSEM '00 Proceedings of the 27th Conference on Current Trends in Theory and Practice of Informatics
Text categorization based on k-nearest neighbor approach for web site classification

Information Processing and Management: an International Journal
Text mining

Handbook of data mining and knowledge discovery
Exploiting sophisticated representations for document retrieval

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A probabilistic model for text categorization: based on a single random variable with multiple values

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Document classification using a finite mixture model

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
A layered approach to NLP-based information retrieval

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Meta-clustering of gene expression data and literature-based information

ACM SIGKDD Explorations Newsletter
Interactive Information Retrieval Using Clustering and Spatial Proximity

User Modeling and User-Adapted Interaction
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Forming test collections with no system pooling

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
The BankSearch web document dataset: investigating unsupervised clustering and category similarity

Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
Detecting action-items in e-mail

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Noisy Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Manipulating large corpora for text classification

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Paraphrasing Japanese noun phrases using character-based indexing

PARAPHRASE '03 Proceedings of the second international workshop on Paraphrasing - Volume 16
Higher order feature selection for text classification

Knowledge and Information Systems
Word sense and subjectivity

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Using bag-of-concepts to improve the performance of support vector machines in text categorization

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Comparison of feature selection and classification algorithms in identifying malicious executables

Computational Statistics & Data Analysis
Contextual feature selection for text classification

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Examining the content load of part of speech blocks for information retrieval

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Multi-candidate reduction: Sentence compression as a tool for document summarization tasks

Information Processing and Management: an International Journal
Language morphology offset: Text classification on a Croatian-English parallel corpus

Information Processing and Management: an International Journal
Reconstructing ddc for interactive classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Dimensionality reduction of features for text categorization

ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
Overview and semantic issues of text mining

ACM SIGMOD Record
Boosting RVM Classifiers for Large Data Sets

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part II
Statistical Identification of Key Phrases for Text Classification

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Hierarchical Text Categorization Through a Vertical Composition of Classifiers

AI*IA '07 Proceedings of the 10th Congress of the Italian Association for Artificial Intelligence on AI*IA 2007: Artificial Intelligence and Human-Oriented Computing
Can Social Tags Help You Find What You Want?

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Text classification based on multi-word with support vector machine

Knowledge-Based Systems
Kernel methods, syntax and semantics for relational text categorization

Proceedings of the 17th ACM conference on Information and knowledge management
A two-stage text mining model for information filtering

Proceedings of the 17th ACM conference on Information and knowledge management
An adaptive personalized news dissemination system

Journal of Intelligent Information Systems
AutoPCS: A Phrase-Based Text Categorization System for Similar Texts

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Automatic classification of Tamil documents using vector space model and artificial neural network

Expert Systems with Applications: An International Journal
Adaptive Web SitesA Knowledge Extraction from Web Data Approach

Proceedings of the 2008 conference on Adaptive Web Sites: A Knowledge Extraction from Web Data Approach
Automatic Detecting Documents Containing Personal Health Information

AIME '09 Proceedings of the 12th Conference on Artificial Intelligence in Medicine: Artificial Intelligence in Medicine
Syntactic and semantic kernels for short text pair categorization

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Automatic thesaurus construction based on grammatical relations

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Hierarchical Bayesian clustering for automatic text classification

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
An effective model of using negative relevance feedback for information filtering

Proceedings of the 18th ACM conference on Information and knowledge management
Phrase-based document categorization revisited

Proceedings of the 2nd international workshop on Patent information retrieval
An extensive study on automated Dewey Decimal Classification

Journal of the American Society for Information Science and Technology
A framework for the computerized assessment of university student essays

Computers in Human Behavior
An ordering of terms based on semantic relatedness

IWCS-8 '09 Proceedings of the Eighth International Conference on Computational Semantics
Learning filtering rulesets for ranking refinement in relevance feedback

Knowledge-Based Systems
Adaptive classification of web documents to users interests

PCI'01 Proceedings of the 8th Panhellenic conference on Informatics
Sentence-level event classification in unstructured texts

Information Retrieval
N-grams and morphological normalization in text classification: a comparison on a Croatian-English parallel corpus

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
Using typical testors for feature selection in text categorization

CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications
Fast categorization of web documents represented by graphs

WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Smoothing LDA model for text categorization

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Text and hypertext categorization

Artificial intelligence
A study of spam filtering using support vector machines

Artificial Intelligence Review
Mining positive and negative patterns for relevance feature discovery

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Text classification with the support of pruned dependency patterns

Pattern Recognition Letters
A comparative study of TF*IDF, LSI and multi-words for text classification

Expert Systems with Applications: An International Journal
A robust linguistic platform for efficient and domain specific web content analysis

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Word co-occurrence features for text classification

Information Systems
A pattern mining approach for information filtering systems

Information Retrieval
Feature selection strategy in text classification

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
High-precision phrase-based document classification on a modern scale

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A semantic kernel to exploit linguistic knowledge

AI*IA'05 Proceedings of the 9th conference on Advances in Artificial Intelligence
A new nearest neighbor rule for text categorization

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
On the utility of incremental feature selection for the classification of textual data streams

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Combining contents and citations for scientific document classification

AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
Filtering contents with bigrams and named entities to improve text classification

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
A two-stage decision model for information filtering

Decision Support Systems
New methods for text categorization based on a new feature selection method and a new similarity measure between documents

IEA/AIE'06 Proceedings of the 19th international conference on Advances in Applied Artificial Intelligence: industrial, Engineering and Other Applications of Applied Intelligent Systems
Effectiveness of document representation for classification

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
CONDOCS: a concept-based document categorization system using concept-probability vector with thesaurus

AIS'04 Proceedings of the 13th international conference on AI, Simulation, and Planning in High Autonomy Systems
A Non-VSM kNN algorithm for text classification

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Topic tracking based on linguistic features

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Assigning polarity scores to reviews using machine learning techniques

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
A survey on feature extraction for pattern recognition

Artificial Intelligence Review
Beyond the bag of words: a text representation for sentence selection

AI'06 Proceedings of the 19th international conference on Advances in Artificial Intelligence: Canadian Society for Computational Studies of Intelligence
Clustering information retrieval search outputs

IRSG'99 Proceedings of the 21st Annual BCS-IRSG conference on Information Retrieval Research
MCut: a thresholding strategy for multi-label classification

IDA'12 Proceedings of the 11th international conference on Advances in Intelligent Data Analysis
Free-gram phrase identification for modeling Chinese text

Information Processing Letters
A pattern based two-stage text classifier

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Text classification for assisting moderators in online health communities

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.02

Visualization

Abstract

Syntactic phrase indexing and term clustering have been widely explored as text representation techniques for text retrieval. In this paper we study the properties of phrasal and clustered indexing languages on a text categorization task, enabling us to study their properties in isolation from query interpretation issues. We show that optimal effectiveness occurs when using only a small proportion of the indexing terms available, and that effectiveness peaks at a higher feature set size and lower effectiveness level for a syntactic phrase indexing than for word-based indexing. We also present results suggesting that traditional term clustering method are unlikely to provide significantly improved text representations. An improved probabilistic text categorization method is also presented.