Domain-independent text segmentation using anisotropic diffusion and dynamic programming

Authors:
Xiang Ji;Hongyuan Zha
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Year:
2003

Citing 22
Cited 22

Scale-Space and Edge Detection Using Anisotropic Diffusion

IEEE Transactions on Pattern Analysis and Machine Intelligence
Approaches to passage retrieval in full text information systems

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Passage-level evidence in document retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic text decomposition using text segments and text themes

Proceedings of the the seventh ACM conference on Hypertext
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Topic segmentation with an aspect hidden Markov model

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering Algorithms

Clustering Algorithms
Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Correlating multilingual documents via bipartite graph modeling

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Topic-based document segmentation with probabilistic latent semantic analysis

Proceedings of the eleventh international conference on Information and knowledge management
A critique and improvement of an evaluation metric for text segmentation

Computational Linguistics
Text Segmentation by Topic

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Discourse Segmentation in Aid of Document Summarization

HICSS '00 Proceedings of the 33rd Hawaii International Conference on System Sciences-Volume 3 - Volume 3
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Text segmentation with multiple surface linguistic cues

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Optimal multi-paragraph text segmentation by dynamic programming

ACL '98 Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Statistical models for topic segmentation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Extracting shared topics of multiple documents

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining

Thread detection in dynamic text message streams

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Text Segmentation Using Latent Semantic Analysis: A Reanalysis of Choi, Wiemer-Hastings, and Moore (2001)

Computational Linguistics
Multi-task text segmentation and alignment based on weighted mutual information

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Minimum cut model for spoken lecture segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Semantic passage segmentation based on sentence topics for question answering

Information Sciences: an International Journal
Topic segmentation with shared topic detection and alignment of multiple documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Finding Text Boundaries and Finding Topic Boundaries: Two Different Tasks?

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
TOWARD A MORE GLOBAL AND COHERENT SEGMENTATION OF TEXTS

Applied Artificial Intelligence
Text segmentation with LDA-based Fisher kernel

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Context-based message expansion for disentanglement of interleaved text conversations

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Efficient linear text segmentation based on information retrieval techniques

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Word distribution based methods for minimizing segment overlaps

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
A dynamic programming model for text segmentation based on min-max similarity

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Evaluating hierarchical discourse segmentation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improved latent concept expansion using hierarchical markov random fields

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Multi-document topic segmentation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Coverage-based methods for distributional stopword selection in text segmentation

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Improving text segmentation with non-systematic semantic relation

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Segmenting eBay item descriptions into coherent sections

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
An automatic approach for efficient text segmentation

KES'06 Proceedings of the 10th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I
A hybrid linear text segmentation algorithm using hierarchical agglomerative clustering and discrete particle swarm optimization

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel domain-independent text segmentation method, which identifies the boundaries of topic changes in long text documents and/or text streams. The method consists of three components: As a preprocessing step, we eliminate the document-dependent stop words as well as the generic stop words before the sentence similarity is computed. This step assists in the discrimination of the sentence semantic information. Then the cohesion information of sentences in a document or a text stream is captured with a sentence-distance matrix with each entry corresponding to the similarity between a sentence pair. The distance matrix can be represented with a gray-scale image. Thus, a text segmentation problem is converted into an image segmentation problem. We apply the anisotropic diffusion technique to the image representation of the distance matrix to enhance the semantic cohesion of sentence topical groups as well as sharpen topical boundaries. At last, the dynamic programming technique is adapted to find the optimal topical boundaries and provide a zoom-in and zoom-out mechanism for topics access by segmenting text in variable numbers of sentence topical groups. Our approach involves no domain-specific training, and it can be applied to texts in a variety of domains. The experimental results show that our approach is effective in text segmentation and outperforms several state-of-the-art methods.