Topic-based document segmentation with probabilistic latent semantic analysis

Authors:
Thorsten Brants;Francine Chen;Ioannis Tsochantaridis
Affiliations:
Palo Alto Research Center, Palo Alto, CA;Palo Alto Research Center, Palo Alto, CA;Brown University, Providence, RI
Venue:
Proceedings of the eleventh international conference on Information and knowledge management
Year:
2002

Citing 9
Cited 35

Subtopic structuring for full-length document access

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised Learning by Probabilistic Latent Semantic Analysis

Machine Learning
A critique and improvement of an evaluation metric for text segmentation

Computational Linguistics
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Topic analysis using a finite mixture model

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

Domain-independent text segmentation using anisotropic diffusion and dynamic programming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A System for new event detection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Web usage mining based on probabilistic latent semantic analysis

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Test Data Likelihood for PLSA Models

Information Retrieval
Web-assisted annotation, semantic indexing and search of television and radio news

WWW '05 Proceedings of the 14th international conference on World Wide Web
Story link detection and new event detection are asymmetric

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Optimizing story link detection is not equivalent to optimizing new event detection

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
CUTS: CUrvature-based development pattern analysis and segmentation for blogs and other Text Streams

Proceedings of the seventeenth conference on Hypertext and hypermedia
Broad coverage paragraph segmentation across languages and domains

ACM Transactions on Speech and Language Processing (TSLP)
Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore (2001)

Computational Linguistics
Semantic passage segmentation based on sentence topics for question answering

Information Sciences: an International Journal
Topic segmentation with shared topic detection and alignment of multiple documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Text segmentation with LDA-based Fisher kernel

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Probabilistic latent semantic user segmentation for behavioral targeted advertising

Proceedings of the Third International Workshop on Data Mining and Audience Intelligence for Advertising
Word distributions for thematic segmentation in a support vector machine approach

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Feature-based segmentation of narrative documents

FeatureEng '05 Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing
Online New Event Detection Based on IPLSA

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Efficient linear text segmentation based on information retrieval techniques

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Text segmentation via topic modeling: an analytical study

Proceedings of the 18th ACM conference on Information and knowledge management
Randomized Probabilistic Latent Semantic Analysis for Scene Recognition

CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Data mining for web personalization

The adaptive web
Textual energy of associative memories: performant applications of enertex algorithm in text summarization and topic segmentation

MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
A dynamic programming model for text segmentation based on min-max similarity

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
A mixture model for expert finding

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Linear text segmentation using classification techniques

Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India
Building adaptive systems for collaborative e-work: the e-workbench approach

Intelligent Decision Technologies - Special issue on knowledge-based environments and services in human-computer interaction
Text segmentation: A topic modeling perspective

Information Processing and Management: an International Journal
A statistical model for topically segmented documents

DS'11 Proceedings of the 14th international conference on Discovery science
A language model approach to capture commercial intent and information relevance for sponsored search

Proceedings of the 20th ACM international conference on Information and knowledge management
A unified probabilistic framework for clustering correlated heterogeneous web objects

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Using probabilistic latent semantic analysis for personalized web search

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Unsupervised topic detection model and its application in text categorization

Proceedings of the CUBE International Information Technology Conference
Optimizing temporal topic segmentation for intelligent text visualization

Proceedings of the 2013 international conference on Intelligent user interfaces
Topic-based Amharic text summarization with probabilistic latent semantic analysis

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Extracting news blog hot topics based on the W2T Methodology

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.