Information retrieval in technical documents: from the user's query to the information-unit tagging

  • Authors:
  • Céline Paganelli;Evelyne Mounier

  • Affiliations:
  • Université Stendhal, Grenoble, France;Université Stendhal, Grenoble, France

  • Venue:
  • Proceedings of the 21st annual international conference on Documentation
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

Information retrieval systems within voluminous textual documents raise specific problems, such as the choice of the retrieval-unit and the relevance of each response. For the selection of the retrieval-unit, several solutions have been proposed, such as the exploitation of the document logical structure. In most cases, a measure of the retrieval-unit relevance is assessed using criteria, such as the number of occurrences of query terms in the document and their position in the document.Few systems are user centered designed and are adapted to the task they are supposed to assist: usually, these systems are based on paper-aid documentation electronically recorded with a standard information retrieval module. Sysrit (technical information retrieval system), a system under development, is aimed at users expert in the search of technical documents. The conception of Sysrit is based on observations made on these users. In this system, a technical document is automatically segmented into paragraphs (called information units. In order to improve the relevance of the responses given to the users, Sysrit proposes to tag the information units. Indeed, we make the assumption that a response is all the more relevant since it belongs to the same category as the query.We show that queries and information units can be first categorized in two types: the object (which corresponds to object descriptions) and the pro type (which concerns procedural descriptions). A detailed study of the object type shows that it is heterogeneous and covers different sub-types: objects descriptions (do), definitions (dfi) and specifications descriptions (df). Upon experimental validation with expert users, we first proposed to categorize the type of each information unit as either object or pro., and second to sub-categorize the object units as do, dfi or df. We here focus on queries more than on information units. A corpus analysis and a validation by expert users confirm that this categorization can also be used to characterize queries. Moreover, the results of this analysis enable us to propose rules in order to automatically recognize and tag each type of queries.