Investigating usage of text segmentation and inter-passage similarities to improve text document clustering

Authors:
Shashank Paliwal;Vikram Pudi
Affiliations:
Center for Data Engineering, International Institute of Information Technology, Hyderabad, India;Center for Data Engineering, International Institute of Information Technology, Hyderabad, India
Venue:
MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Year:
2012

Citing 12
Cited 0

Approaches to passage retrieval in full text information systems

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Subtopic structuring for full-length document access

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Passage-level evidence in document retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Effective ranking with arbitrary passages

Journal of the American Society for Information Science and Technology
Quantitative evaluation of passage retrieval algorithms for question answering

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
An Evaluation of Passage-Based Text Categorization

Journal of Intelligent Information Systems
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Effective document clustering for large heterogeneous law firm collections

ICAIL '05 Proceedings of the 10th international conference on Artificial intelligence and law
A new suffix tree similarity measure for document clustering

Proceedings of the 16th international conference on World Wide Web
Using Text Segmentation to Enhance the Cluster Hypothesis

AIMSA '08 Proceedings of the 13th international conference on Artificial Intelligence: Methodology, Systems, and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other semantic units like sentences, passages etc. In this paper, we attempt to take advantage of underlying subtopic structure of text documents and investigate whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them. We concentrate on examining effects of combining suggested inter-document similarities (based on inter-passage similarities) with traditional inter-document similarities following a simple approach for the same. Experimental results on standard data sets suggest improvement in clustering of text documents.