Two-part segmentation of text documents

Authors:
Deepak P.;Karthik Visweswariah;Nirmalie Wiratunga;Sadiq Sani
Affiliations:
IBM Research - India, Bangalore, India;IBM Research - India, Bangalore, India;Robert Gordon University, Aberdeen, United Kingdom;Robert Gordon University, Aberdeen, United Kingdom
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 20
Cited 0

A statistical approach to machine translation

Computational Linguistics
Case-based reasoning: foundational issues, methodological variations, and system approaches

AI Communications
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
A general language model for information retrieval

Proceedings of the eighth international conference on Information and knowledge management
A critique and improvement of an evaluation metric for text segmentation

Computational Linguistics
Textual CBR

Case-Based Reasoning Technology, From Foundations to Applications
Discourse segmentation by human and automated means

Computational Linguistics
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
An automatic method of finding topic boundaries

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A noisy-channel approach to question answering

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Finding semantically similar questions based on their answers

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Retrieval models for question and answer archives

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Bayesian unsupervised topic segmentation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised segmentation of conversational transcripts

Statistical Analysis and Data Mining
More or better: on trade-offs in compacting textual problem solution repositories

Proceedings of the 20th ACM international conference on Information and knowledge management
Linear text segmentation using affinity propagation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Applying machine translation evaluation techniques to textual CBR

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Introspective knowledge revision in textual case-based reasoning

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the inter-relatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.