An investigation of linguistic features and clustering algorithms for topical document clustering

Authors:
Vasileios Hatzivassiloglou;Luis Gravano;Ankineedu Maganti
Affiliations:
Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY;Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY;Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY
Venue:
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2000

Citing 12
Cited 33

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
On the application of syntactic methodologies in automatic text analysis

SIGIR '89 Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval
Interpreting nominal compounds for information retrieval

Information Processing and Management: an International Journal - Special issue on natural language processing and information retrieval
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Progress in the application of natural language processing to information retrieval tasks

The Computer Journal - Special issue on information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A study of retrospective and on-line event detection

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Towards multidocument summarization by reformulation: progress and prospects

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Disambiguation of proper names in text

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
MITRE: description of the Alembic system used for MUC-6

MUC6 '95 Proceedings of the 6th conference on Message understanding

Unsupervised and supervised clustering for topic tracking

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Multiple related document summary and navigation using concept hierarchies for mobile clients

Proceedings of the 2002 ACM symposium on Applied computing
TopCat: Data Mining for Topic Identification in a Text Corpus

IEEE Transactions on Knowledge and Data Engineering
Topic Detection in the news domain

ISICT '04 Proceedings of the 2004 international symposium on Information and communication technologies
First story detection using a composite document representation

HLT '01 Proceedings of the first international conference on Human language technology research
Using syntactic analysis to increase efficiency in visualizing text collections

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Automatic construction of multifaceted browsing interfaces

Proceedings of the 14th ACM international conference on Information and knowledge management
Thread detection in dynamic text message streams

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians

Journal of Biomedical Informatics
Efficient summarization-aware search for online news articles

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Tracking and summarizing news on a daily basis with Columbia's Newsblaster

HLT '02 Proceedings of the second international conference on Human Language Technology Research
The Evaluation Measure of Text Clustering for the Variable Number of Clusters

ISNN '07 Proceedings of the 4th international symposium on Neural Networks: Part II--Advances in Neural Networks
Table Based Single Pass Algorithm for Clustering News Articles in NewsPage.com

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Towards the Automatic Construction of Conceptual Taxonomies

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Automatic discovery of topics and acoustic morphemes from speech

Computer Speech and Language
An adaptive threshold framework for event detection using HMM-based life profiles

ACM Transactions on Information Systems (TOIS)
Computational linguistics for metadata building (CLiMB): using text mining for the automatic identification, categorization, and disambiguation of subject terms for image metadata

Multimedia Tools and Applications
Inferring activity time in news through event modeling

HLT-SRWS '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Student Research Workshop
Columbia Newsblaster: multilingual news summarization on the web

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Learning similarity metrics for event identification in social media

Proceedings of the third ACM international conference on Web search and data mining
New event detection and topic tracking in Turkish

Journal of the American Society for Information Science and Technology
A preliminary study on multiple documents access via mobile devices

HSI'03 Proceedings of the 2nd international conference on Human.society@internet
Which clustering do you want? inducing your ideal clustering with minimal feedback

Journal of Artificial Intelligence Research
Research of fast SOM clustering for text information

Expert Systems with Applications: An International Journal
WikiTopics: what is popular on Wikipedia and why

WASDGML '11 Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages
Identifying content for planned events across social media sites

Proceedings of the fifth ACM international conference on Web search and data mining
Text categorization based on subtopic clusters

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Dynamic pattern mining: an incremental data clustering approach

Journal on Data Semantics II
Indices of novelty for emerging topic detection

Information Processing and Management: an International Journal
Lydia: a system for large-scale news analysis

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
On-line single-pass clustering based on diffusion maps

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Semi-Automatic Ontology Construction by Exploiting Functional Dependencies and Association Rules

International Journal on Semantic Web & Information Systems
Automated crime report analysis and classification for e-government and decision support

Proceedings of the 14th Annual International Conference on Digital Government Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information from multiple sources is described and applied to DARPA's Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm and the introduction of the additional features have an impact on clustering performance. We apply our optimal combination of features to the TDT2 test data, obtaining partitions of the documents that compare favorably with the results obtained by participants in the official TDT2 competition.