An iterative approach to text segmentation

Authors:
Fei Song;William M. Darling;Adnan Duric;Fred W. Kroon
Affiliations:
School of Computer Science, University of Guelph, Guelph, Ontario, Canada;School of Computer Science, University of Guelph, Guelph, Ontario, Canada;School of Computer Science, University of Guelph, Guelph, Ontario, Canada;PryLynx Corporation, Kitchener, Ontario, Canada
Venue:
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Year:
2011

Citing 9
Cited 0

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
A critique and improvement of an evaluation metric for text segmentation

Computational Linguistics
Topic segmentation: algorithms and applications

Topic segmentation: algorithms and applications
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Bayesian unsupervised topic segmentation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Hierarchical text segmentation from multi-scale lexical cohesion

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A dynamic programming model for text segmentation based on min-max similarity

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present divSeg, a novel method for text segmentation that iteratively splits a portion of text at its weakest point in terms of the connectivity strength between two adjacent parts. To search for the weakest point, we apply two different measures: one is based on language modeling of text segmentation and the other, on the interconnectivity between two segments. Our solution produces a deep and narrow binary tree - a dynamic object that describes the structure of a text and that is fully adaptable to a user's segmentation needs. We treat it as a separate task to flatten the tree into a broad and shallow hierarchy either through supervised learning of a document set or explicit input of how a text should be segmented. The rich structure of our created tree further allows us to segment documents at varying levels such as topic, sub-topic, etc. We evaluated our new solution on a set of 265 articles from Discover magazine where the topic structures are unknown and need to be discovered. Our experimental results show that the iterative approach has the potential to generate better segmentation results than several leading baselines, and the separate flattening step allows us to adapt the results to different levels of details and user preferences.