Efficient summarization-aware search for online news articles

Authors:
Wisam Dakka;Luis Gravano
Affiliations:
Columbia University, New York City, NY;Columbia University, New York City, NY
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 17
Cited 2

The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval

The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Feature Selection: Evaluation, Application, and Small Sample Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
An investigation of linguistic features and clustering algorithms for topical document clustering

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
The effectiveness of query-specific hierarchic clustering in information retrieval

Information Processing and Management: an International Journal
Support Vector Machines

IEEE Intelligent Systems
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to cluster web search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
NewsInEssence: a system for domain-independent, real-time news clustering and multi-document summarization

HLT '01 Proceedings of the first international conference on Human language technology research
Tracking and summarizing news on a daily basis with Columbia's Newsblaster

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Automated multi-document summarization in NeATS

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Test collection management and labeling system

Proceedings of the 9th ACM symposium on Document engineering
Natural language processing tool to support web search

Proceedings of the 13th International Conference on Humans and Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

News portals gather and organize news articles published daily on the Internet. Typically, news articles are clustered into 'events' and each cluster is displayed with a short description of its contents. A particularly interesting choice for describing the contents of a cluster is a machine-generated multi-document summary of the articles in the cluster. Such summaries are informative and help news readers to identify and explore only clusters of interest. Naturally, multi-document clusters and summaries are also valuable to help users navigate the results of keyword-search queries. Unfortunately, current document summarizers are still slow; as a result, search strategies that define document clusters and their multi-document summaries online, in a query-specific manner, are prohibitively expensive. In contrast, search strategies that only return offline, query-independent document clusters are efficient, but might return clusters whose (query-independent) summaries are of little relevance to the queries. In this paper, we present an efficient Hybrid search strategy to address the limitations of fully online and fully offline summarization-aware search approaches. Extensive experiments involving user relevance judgments and real news articles show that the quality of our Hybrid results is high, and that these results are computed in substantially less time than with the fully online strategy. We have implemented our strategy and made it available on the Newsblaster news summarization system, which crawls and summarizes news articles from a variety of web sources on a daily basis.