Automatic extraction of domain-specific stopwords from labeled documents

Authors:
Masoud Makrehchi;Mohamed S. Kamel
Affiliations:
Pattern Analysis and Machine Intelligence Lab, Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, Canada;Pattern Analysis and Machine Intelligence Lab, Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, Canada
Venue:
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Year:
2008

Citing 18
Cited 2

Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
Information Retrieval

Information Retrieval
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Automatic Software Clustering via Latent Semantic Analysis

ASE '99 Proceedings of the 14th IEEE international conference on Automated software engineering
Mining Association Algorithm with Threshold based on ROC Analysis

HICSS '01 Proceedings of the 34th Annual Hawaii International Conference on System Sciences ( HICSS-34)-Volume 3 - Volume 3
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Towards Modernised and Web-Specific Stoplists for Web Document Analysis

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Building an Ontology Based on Hub Words for Information Retrieval

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Evolving better stoplists for document clustering and web intelligence

Design and application of hybrid intelligent systems
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
A pitfall and solution in multi-class feature selection for text classification

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A Hybrid Approach to Concept Extraction and Recognition-Based Matching in the Domain of Human Resources

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
An application of text categorization methods to gene ontology annotation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Text mining for software engineering: how analyst feedback impacts final results

MSR '05 Proceedings of the 2005 international workshop on Mining software repositories

Current research issues and trends in non-English Web searching

Information Retrieval
Vocabulary filtering for term weighting in archived question search

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.