Approaches to passage retrieval in full text information systems
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Subtopic structuring for full-length document access
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Passage-level evidence in document retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing
Communications of the ACM
Effective ranking with arbitrary passages
Journal of the American Society for Information Science and Technology
Quantitative evaluation of passage retrieval algorithms for question answering
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
An Evaluation of Passage-Based Text Categorization
Journal of Intelligent Information Systems
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Phrase-Based Document Indexing for Web Document Clustering
IEEE Transactions on Knowledge and Data Engineering
Effective document clustering for large heterogeneous law firm collections
ICAIL '05 Proceedings of the 10th international conference on Artificial intelligence and law
A new suffix tree similarity measure for document clustering
Proceedings of the 16th international conference on World Wide Web
Using Text Segmentation to Enhance the Cluster Hypothesis
AIMSA '08 Proceedings of the 13th international conference on Artificial Intelligence: Methodology, Systems, and Applications
Hi-index | 0.00 |
Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other semantic units like sentences, passages etc. In this paper, we attempt to take advantage of underlying subtopic structure of text documents and investigate whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them. We concentrate on examining effects of combining suggested inter-document similarities (based on inter-passage similarities) with traditional inter-document similarities following a simple approach for the same. Experimental results on standard data sets suggest improvement in clustering of text documents.