Vocabulary filtering for term weighting in archived question search

Authors:
Zhao-Yan Ming;Kai Wang;Tat-Seng Chua
Affiliations:
Department of Computer Science, School of Computing, National University of Singapore;Department of Computer Science, School of Computing, National University of Singapore;Department of Computer Science, School of Computing, National University of Singapore
Venue:
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Year:
2010

Citing 10
Cited 1

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
An information-theoretic perspective of tf—idf measures

Information Processing and Management: an International Journal
Finding similar questions in large question and answer archives

Proceedings of the 14th ACM international conference on Information and knowledge management
Interesting nuggets and their impact on definitional question answering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
TF-IDF uncovered: a study of theories and probabilities

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval models for question and answer archives

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A syntactic tree matching approach to finding similar questions in community-based qa services

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Domain-specific keyphrase extraction

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Automatic extraction of domain-specific stopwords from labeled documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Discovering high quality answers in community question answering archives using a hierarchy of classifiers

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes the notion of vocabulary filtering in a term weighting framework that consists of three filters at the document level, collection level, and vocabulary level. While term frequency and document frequency along with their variations are respectively the dominant term weighting factors at the document level and collection level, vocabulary level factors are seldom considered in current models. In a way, stopword removal can be seen as a vocabulary level filter, but it is not well integrated into the current term-weighting models. In this paper, we propose a vocabulary filtering and multi-level term weighting model by integrating point-wise divergence based measure into the commonly used TF-IDF model. With our proposed model, the specificity of the vocabulary is captured as a new factor in term weighting, and stopwords are naturally handled within the model rather than being removed according to a separately constructed list. Experiments conducted on searching for similar questions in a large community-based question answering archive show that: (a)our proposed term weighting model with multiple levels is consistently better than those with single level for retrieval task; (b)the proposed vocabulary filter well distinguishes salient and trivial terms, and can be utilized to construct stopword lists.