Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Neural Network Agents for Learning Semantic Text Classification
Information Retrieval
The use of bigrams to enhance text categorization
Information Processing and Management: an International Journal
Fast statistical parsing of noun phrases for document indexing
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Feature selection and feature extraction for text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
A risk minimization framework for information retrieval
Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
A web-based multi-agent system approach to document engineering
International Journal of Web Engineering and Technology
Statistical Language Models for Information Retrieval A Critical Review
Foundations and Trends in Information Retrieval
Classification of tasks using machine learning
PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
A risk minimization framework for information retrieval
Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Active learning with committees for text categorization
AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Extended bi-gram features in text categorization
IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II
A novel web page categorization algorithm based on block propagation using query-log information
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Query by babbling: a research agenda
Proceedings of the first workshop on Information and knowledge management for developing region
Hi-index | 0.00 |
This dissertation introduces a new theoretical model for text classification systems, including systems for document retrieval, automated indexing, elec- tronic mail filtering, and similar tasks. The Concept Learning model emphasizes the role manual and automated feature selection and classifier formation in text classification. It enables drawing on results from statistics and machine learning in explaining the effectiveness of alternate representations of text, and specifies desirable characteristics of text representations. The use of syntactic parsing to produce indexing phrases has been widely investigated as a possible route to better text representations. Experiments with syntactic phrase indexing, however, have never yielded significant improve- ments in text retrieval performance. The Concept Learning model suggests that the poor statistical characteristics of a syntactic indexing phrase representation negate its dsirable semantic characteristics. The application of term clustering to this representation to improve its statistical properties while retaining its desirable meaning properties is proposed. Standard term clustering strategies from information retrieval (IR), based on cooccurence of indexing terms in documents or groups of documents, were tested on a syntactic indexing phrase representation. In experiments using a standard text retrieval test collection, small effectiveness improvements were obtained. As a means of evaluating representation quality, a text retrieval test collection introduces a number of confounding factors. In contrast, the text categorization task allows much cleaner determination of text representation properties. In preparation for the use of text categorization to study text representation, a more effective and theoretically well-founded probablistic text categorization algorithm was developed, building on work by Maron, Fuhr, and others. Text categorization experiments supported a number of predictions of the Concept Learning model about properties of phrasal representations, includ- ing dimensionality properties not previously measured for text representations. However, in carefully controlled experiments using syntactic phrases produced by Church''s stochastic bracketer, in conjunction with reciprocal nearest neighbor clustering, term clustering was found to produce essentially no improvement in the properties of the phrasal representation. New cluster analysis approaches are proposed to remedy the problems found in traditional term clustering methods.