Enhancing information retrieval through statistical natural language processing: a study of collocation indexing

Authors:
Ofer Arazy;Carson Woo
Affiliations:
The University of Alberta, Edmonton, AB, Canada;Sauder School of Business, University of British Columbia, Vancouver, BC, Canada
Venue:
MIS Quarterly
Year:
2007

Citing 23
Cited 3

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval

Journal of the American Society for Information Science
Word association norms, mutual information, and lexicography

Computational Linguistics
Automatic text structuring and retrieval-experiments in automatic encyclopedia searching

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
The use of phrases and structured queries in information retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
An Information Retrieval Approach for Automatically Constructing Software Libraries

IEEE Transactions on Software Engineering
Electronic document management: challenges and opportunities for information systems managers

MIS Quarterly
WordNet: a lexical database for English

Communications of the ACM
Natural language processing for information retrieval

Communications of the ACM
Success of data resource management in distributed environments: an empirical investigation

MIS Quarterly
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Using cause-effect relations in text to improve information retrieval precision

Information Processing and Management: an International Journal
Modern Information Retrieval

Modern Information Retrieval
Biterm language models for document retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Using the Co-occurrence of Words for Retrieval Weighting

Information Retrieval
Word sense disambiguation in information retrieval revisited

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Word classification and hierarchy using co-occurrence word information

Information Processing and Management: an International Journal
Generating and Browsing Multiple Taxonomies Over a Document Collection

Journal of Management Information Systems
A business intelligence system

IBM Journal of Research and Development
Design science in information systems research

MIS Quarterly

Towards a paradigmatic shift in IS: designing for social practice

Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology
Detecting fake websites: the contribution of statistical learning theory

MIS Quarterly
Web 2.0 environmental scanning and adaptive decision support for business mergers and acquisitions

MIS Quarterly

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although the management of information assets-specifically, of text documents that make up 80 percent of these assets-an provide organizations with a competitive advantage, the ability of information retrieval (IR) systems to deliver relevant information to users is severely hampered by the difficulty of disambiguating natural language. The word ambiguity problem is addressed with moderate success in restricted settings, but continues to be the main challenge for general settings, characterized by large, heterogeneous document collections. In this paper, we provide preliminary evidence for the usefulness of statistical natural language processing (NLP) techniques, and specifically of collocation indexing, for IR in general settings. We investigate the effect of three key parameters on collocation indexing performance: directionality, distance, and weighting. We build on previous work in IR to (1) advance our knowledge of key design elements for collocation indexing, (2) demonstrate gains in retrieval precision from the use of statistical NLP for general-settings IR, and, finally, (3) provide practitioners with a useful cost-benefit analysis of the methods under investigation.