Topic modeling for mediated access to very large document collections

Authors:
Gheorghe Muresan;David J. Harper
Affiliations:
Department of Library and Information Science, Rutgers University, New Brunswick, NJ;School of Computing, The Robert Gordon University, Aberdeen AB25 1HG, Scotland, United Kingdom
Venue:
Journal of the American Society for Information Science and Technology
Year:
2004

Citing 19
Cited 6

Techniques for the measurement of clustering tendency in document retrieval systems

Journal of Information Science
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Relevance feedback revisited

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Performance standards and evaluations in IR test collections: cluster-based retrieval models

Information Processing and Management: an International Journal
Elicitation behavior during mediated information retrieval

Information Processing and Management: an International Journal
Foundations of statistical natural language processing

Foundations of statistical natural language processing
“User revealment”—a comparison of initial queries and ensuing question development in online searching and in human reference interactions

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
WebCluster, a tool for mediated information access (demonstration abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
A knowledge-based approach to organizing retrieved documents

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
ClusterBook, a tool for dual information access (demonstration session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Reading time, scrolling and interaction: exploring implicit sources of user preferences for relevance feedback

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Exemplary documents: a foundation for information retrieval design

Information Processing and Management: an International Journal
Dynamic categorization: a method for decreasing information overload

Dynamic categorization: a method for decreasing information overload
Interactive information organization: techniques and evaluation

Interactive information organization: techniques and evaluation
Document clustering for mediated information access

IRSG'99 Proceedings of the 21st Annual BCS-IRSG conference on Information Retrieval Research

Automatic new topic identification using multiple linear regression

Information Processing and Management: an International Journal
Contextual relevance feedback

IIiX Proceedings of the 1st international conference on Information interaction in context
Using Monte-Carlo simulation for automatic new topic identification of search engine transaction logs

Proceedings of the 39th conference on Winter simulation: 40 years! The best is yet to come
Visualising the structure of document search results: a comparison of graph theoretic approaches

Information Visualization
Exploring interactive information retrieval: an integrated approach to interface design and interaction analysis

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Integrating interaction design and log analysis: bridging the gap with UML, XML and XMI

Journal of Web Engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

Clear and precise queries are a necessity when searching very large document collections, especially when query-based retrieval is the only means of exploration. We propose system-mediated information access as a solution for users' well-documented inability to formulate good queries. Our approach is based on two main assumptions: first, on the ability of document clustering to reveal the topical, semantic structure of a problem domain represented by a specialized "source collection," and, second, on the capacity of statistical language models to convey content. Taking the role of the human mediator or intermediary searcher, a mediation system interacts with the user and supports her exploration of a relatively small source collection, chosen to be representative for the problem domain. Based on the user's selection of relevant "exemplary" documents and clusters from this source collection, the system builds a language model of her information need. This model is subsequently used to derive "mediated queries," which are expected to convey precisely and comprehensively the user's information need, and can be submitted by the user to search any large and heterogeneous "target collections." We present results of experiments that simulated various mediation strategies and compared the effect on mediation effectiveness of a variety of parameters, such as the similarity measure, the weighting scheme, and the clustering method. They provide both upperbounds of performance that can potentially be reached by real end users and a comparison between the effectiveness of these strategies. The experimental evidence suggests that information retrieval mediated through a clustered specialized collection has potential to improve effectiveness significantly.