Descriptive modelling of text classification and its integration with other IR tasks

  • Authors:
  • Miguel Martinez-Alvarez

  • Affiliations:
  • Queen Mary, University of London, London, United Kingdom

  • Venue:
  • Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays, Information Retrieval (IR) systems have to deal with multiple sources of data available in different formats. Datasets can consist of complex and heterogeneous objects with relationships between them. In addition, information needs can vary wildly and they can include different tasks. As a result, the importance of flexibility in IR systems is rapidly growing. This fact is specially important in environments where the information required at different moments is very different and its utility may be contingent on timely implementation. In these cases, how quickly a new problem is solved is as important as how well you solve it. Current systems are usually developed for specific cases. It implies that too much engineering effort is needed to adapt them when new knowledge appears or there are changes in the requirements. Furthermore, heterogeneous and linked data present greater challenges, as well as the simultaneous application of different tasks. This research proposes the usage of descriptive approaches for three different purposes: the modelling of the specific task of Text Classification (TC), focusing on knowledge and complex data exploitation; the flexible application of models to different tasks; and the simultaneously application of different IR-tasks. This investigation will contribute to the long-term goal of achieving a descriptive and composable IR technology that provides a modular framework that knowledge engineers can compose into a task-specific solution. The ultimate goal is to develop a flexible framework that offers classifiers, retrieval models, information extractors, and other functions. In addition, those functional blocks could be customised to satisfy user needs. Descriptive approaches allow a high-level definition of algorithms which are, in some cases, as compact as mathematical formulations. One of the expected benefits is to make the implementation clearer and the knowledge transfer easier. They allow models from different tasks to be defined as modules that can be "concatenated", processing the information as a pipeline where some of the outputs of one module are the input of the following one. This combination involves minimum engineering effort due to the paradigm's "Plug & Play" capabilities offered by its functional syntax. This solution provides the flexibility needed to customise and quickly combine different IR-tasks and/or models. Classification is a desired candidate for being part of a flexible IR framework because it can be required in several situations for different purposes. In particular, descriptive approaches will improve its modelling with complex and heterogeneous objects. Furthermore, we aim to show how this approach allows to apply TC models for ad-hoc retrieval (and vice versa) and their simultaneous application for complex information needs. The main hypothesis of this research is that a seamless approach for modelling TC and its integration with other IR-tasks will provide a general framework for rapid prototyping and modelling of solutions for specific users. In addition, it will allow new complex models that take into account relationships and inference from large ontologies. The importance of flexibility for information systems and the exploitation of complex information and knowledge from heterogeneous sources are the main points for discussion. The main challenges are expressiveness and scalability. Abstraction improves flexibility and maintainability. However, it limits the modelling power. A balance between abstraction and expressiveness has to be reached. On the other hand, scalability has been traditionally a challenge for descriptive modelling. Our goal is to prove the feasibility of our approach for real-scale environments.