Salton Award Lecture - Information retrieval and computer science: an evolving relationship

  • Authors:
  • W. Bruce Croft

  • Affiliations:
  • University of Massachusetts, Amherst, MA

  • Venue:
  • Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Following the tradition of these acceptance talks, I will begiving my thoughts on where our field is going. Any discussion ofthe future of information retrieval (IR) research, however, needsto be placed in the context of its history and relationship toother fields. Although IR has had a very strong relationship withlibrary and information science, its relationship to computerscience (CS) and its relative standing as a sub-discipline of CShas been more dynamic. IR is quite an old field, and when a numberof CS departments were forming in the 60s, it was not uncommon fora faculty member to be pursuing research related to IR. Early ACMcurriculum recommendations for CS contained courses on informationretrieval, and encyclopedias described IR and database systems asdifferent aspects of the same field.By the 70s, there were only a few IR researchers in CSdepartments in the U.S., database systems was a separate (andthriving) field, and many felt that IR had stagnated and waslargely irrelevant. The truth, in fact, was far from that. The IRresearch community was a small, but dedicated, group of researchersin the U.S. and Europe who were motivated by a desire to understandthe process of information retrieval and to build systems thatwould help people find the right information in text databases.This was (and is) a hard goal and led to different evaluationmetrics and methodologies than the database community. Progress inthe field was hampered by a lack of large-scale testbeds and testswere limited to databases containing at most a few hundred documentabstracts.In the 80s AI boom, IR was still not a mainstream area, despiteits focus on a human task involving natural language. IR focused ona statistical approach to language rather than the much morepopular knowledge-based approach. The fact that IR conferences mixpapers on effectiveness as measured by human judgments with papersmeasuring performance of file organizations for large-scale systemshas meant that IR has always been difficult to classify into simplecategories such as "systems" or "AI" that are often used in CSdepartments.Since the early 90s, just about everything has changed. Large,full-text databases were finally made available for experimentationthrough DARPA funding and TREC. This has had an enormous positiveimpact on the quantity and quality of IR research. The advent ofthe Web search engine has validated the longstanding claims made byIR researchers that simple queries and ranking were the righttechniques for information access in a largely unstructuredinformation world. What has not changed is that there are stillrelatively few IR researchers in CS departments. There are,however, many more people in CS departments doing IR-relatedresearch, which is just about the same thing. Conferences indatabases, machine learning, computational linguistics, and datamining publish a number of IR papers done by people who would notprimarily consider themselves as IR researchers.Given that there is an increasing diffusion of IR ideas into theCS community, it is worth stating what IR, as a field of CS, hasaccomplished:Search engines have become the infrastructure for much ofinformation access in our society. IR has provided the basicresearch on the algorithms and data structures for these engines,and continues to develop new capabilities such as cross-lingualsearch, distributed search, question answering, and topic detectionand tracking.IR championed the statistical approach to language long beforeit was accepted by other researchers working on languagetechnologies. Statistical NLP is now mainstream and results fromthat field are being used to improve IR systems (in questionanswering, for example).IR focused on evaluation as a research area, and developed anevaluation methodology based on large, standardized testbeds andcomparison with human judgments that has been adopted byresearchers in a number of other language technology areas.IR, because of its focus on measuring success based on humanjudgments, has always acknowledged the importance of the user andinteraction as a part of information access. This led to a numberof contributions to the design of query and search interfaces andlearning techniques based on user feedback. Although these achievements are important, the long-term goalsof the IR field have not yet been met. What are those goals? Onepossibility that is often mentioned is the MEMEX of Vannevar Bush[1]. Another, more recent, statement of long-term challenges wasmade in the report of the IR Challenges Workshop [2]:Global Information Access: Satisfy human information needsthrough natural, efficient interaction with an automated systemthat leverages world-wide structured and unstructured data in anylanguage.Contextual Retrieval; Combine search technologies and knowledgeabout query and user context into a single framework in order toprovide the most appropriate answer for a user's informationneed. These goals are, in fact, very similar to long-term challengescoming out of other CS fields. For example, Jim Gray, a TuringAward winner from the database area, mentioned in his address apersonal and world MEMEX as long-term goals for his field and CS ingeneral [3]. IR's long-term goals are clearly important long-termgoals for the whole of CS, and achieving those goals will involveeveryone interested in the general area of information managementand retrieval. Rather than talking about what IR can do inisolation to progress towards its goals, I would prefer to talkabout what IR can do in collaboration with other areas.There are many examples of potential collaborative researchareas. Collaborations with researchers from the NLP and informationextraction communities have been developing for some time in orderto study topics such as advanced question answering. On the otherhand, not enough has been done to work with the database communityto develop probabilistic retrieval models for unstructured,semi-structured, and structured data. There have been a number ofattempts to combine IR and database functionality, none of whichhas been particularly successful. Most recently, some groups havebeen working on combining IR search with XML documents, but what isneeded is a comprehensive examination of the issues and problems byteams from both areas working together, and the creation of newtestbeds that can be used to evaluate proposed models. The time isright for such collaborations.Another example of where database, IR, and networking people canwork together is in the development of distributed, heterogeneousinformation systems. This requires significant new research inareas like peer-to-peer architectures, semantic heterogeneity,automatic metadata generation, and retrieval models.If the information systems described above are extended toinclude new data types such as video, images, sound, and the wholerange of scientific data (such as from the biosciences, geoscience,and astronomy), then a broad range of new challenges are added thatneed to be tackled in collaboration with people who know aboutthese types of data.There should also be more cooperation between the data mining,IR, and summarization communities to tackle the core problem ofdefining what is new and interesting in streams of data.These and other similar collaborations will the basis for thefuture development of the IR field. We will continue to work onresearch problems that specifically interest us, but this researchwill increasingly be in the context of larger efforts. IR conceptsand IR research will be an important part of the evolving mix of CSexpertise that will be used to solve the "grand" challenges.