Language model-based document clustering using random walks

Authors:
Güneş Erkan
Affiliations:
University of Michigan, Ann Arbor, MI
Venue:
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Year:
2006

Citing 21
Cited 15

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering spatial data using random walks

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Bipartite graph partitioning and data clustering

Proceedings of the tenth international conference on Information and knowledge management
Information Retrieval

Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
The use of bigrams to enhance text categorization

Information Processing and Management: an International Journal
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Learning random walk models for inducing word dependency distributions

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Algorithmic detection of semantic similarity

WWW '05 Proceedings of the 14th international conference on World Wide Web
PageRank without hyperlinks: structural re-ranking using links induced by language models

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Relevance models for topic detection and tracking

HLT '02 Proceedings of the second international conference on Human Language Technology Research
LexRank: graph-based lexical centrality as salience in text summarization

Journal of Artificial Intelligence Research

Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A new document representation using term frequency and vectorized graph connectionists with application to document retrieval

Expert Systems with Applications: An International Journal
Tracking the dynamic evolution of participant salience in a discussion

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Scientific paper summarization using citation summary networks

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
WIT: web people search disambiguation using random walks

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
A novel clustering algorithm based upon games on evolving network

Expert Systems with Applications: An International Journal
PageRank without hyperlinks: Structural reranking using links induced by language models

ACM Transactions on Information Systems (TOIS)
Utilizing inter-passage and inter-document similarities for reranking search results

ACM Transactions on Information Systems (TOIS)
A hybrid classical-quantum clustering algorithm based on quantum walks

Quantum Information Processing
Re-ranking search results using an additional retrieved list

Information Retrieval
From "identical" to "similar": fusing retrieved lists based on inter-document similarities

Journal of Artificial Intelligence Research
The opposite of smoothing: a language model approach to ranking query-specific document clusters

Journal of Artificial Intelligence Research
A multi-level matching method with hybrid similarity for document retrieval

Expert Systems with Applications: An International Journal
Revisiting centrality-as-relevance: support sets and similarity as geometric proximity

Journal of Artificial Intelligence Research
Generating extractive summaries of scientific paradigms

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new document vector representation specifically designed for the document clustering task. Instead of the traditional term-based vectors, a document is represented as an n-dimensional vector, where n is the number of documents in the cluster. The value at each dimension of the vector is closely related to the generation probability based on the language model of the corresponding document. Inspired by the recent graph-based NLP methods, we reinforce the generation probabilities by iterating random walks on the underlying graph representation. Experiments with k-means and hierarchical clustering algorithms show significant improvements over the alternative tf·idf vector representation.