Segmentation and detection at IBM: hybrid statistical models and two-tiered clustering

Authors:
S. Dharanipragada;M. Franz;J. S. McCarley;T. Ward;W.-J. Zhu
Affiliations:
IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY
Venue:
Topic detection and tracking
Year:
2002

Citing 2
Cited 3

Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning

Supporting access to large digital oral history archives

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Subword Lexical Chaining for Automatic Story Segmentation in Chinese Broadcast News

PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
An approach to indexing and clustering news stories using continuous language models

NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

IBM's story segmentation uses a combination of decision tree and maximum entropy models. They take a variety of lexical, prosodic, semantic, and structural features as their inputs. Both types of models are source-specific, and we substantially lower Cseg by combining them. IBM's topic detection system introduces a minimal hierarchy into the clustering: each cluster is comprised of one or more microclusters. We investigate the importance of merging microclusters together, and propose a merging strategy which improves our performance.