Learning to extract symbolic knowledge from the World Wide Web
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A stemming procedure and stopword list for general French corpora
Journal of the American Society for Information Science
Information Retrieval
High-performing feature selection for text classification
Proceedings of the eleventh international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Automatic Software Clustering via Latent Semantic Analysis
ASE '99 Proceedings of the 14th IEEE international conference on Automated software engineering
Mining Association Algorithm with Threshold based on ROC Analysis
HICSS '01 Proceedings of the 34th Annual Hawaii International Conference on System Sciences ( HICSS-34)-Volume 3 - Volume 3
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Towards Modernised and Web-Specific Stoplists for Web Document Analysis
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Building an Ontology Based on Hub Words for Information Retrieval
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Evolving better stoplists for document clustering and web intelligence
Design and application of hybrid intelligent systems
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
A pitfall and solution in multi-class feature selection for text classification
ICML '04 Proceedings of the twenty-first international conference on Machine learning
ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
An application of text categorization methods to gene ontology annotation
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Text mining for software engineering: how analyst feedback impacts final results
MSR '05 Proceedings of the 2005 international workshop on Mining software repositories
Current research issues and trends in non-English Web searching
Information Retrieval
Vocabulary filtering for term weighting in archived question search
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Hi-index | 0.00 |
Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.