Adaptive information extraction for document annotation in amilcare

  • Authors:
  • Fabio Ciravegna;Alexiei Dingli;Yorick Wilks;Daniela Petrelli

  • Affiliations:
  • University of Sheffield, Sheffield, UK;University of Sheffield, Sheffield, UK;University of Sheffield, Sheffield, UK;University of Sheffield, Sheffield, UK

  • Venue:
  • SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Amilcare is a tool for Adaptive Information Extraction (IE) designed for supporting active annotation of documents for the Semantic Web (SW). It can be used either for unsupervised document annotation or as a support for human annotation. Amilcare is portable to new applications/domains without any knowledge of IE, as it just requires users to annotate a small training corpus with the information to be extracted. It is based on (LP)2, a supervised learning strategy for IE able to cope with different texts types, from newspaper-like texts, to rigidly formatted Web pages and even a mixture of them[1][5].Adaptation starts with the definition of a tag set for annotation, possibly organized as an ontology. Then users have to manually annotate a small training corpus. Amilcare provides a default mouse-based interface called Melita, where annotations are inserted by first selecting a tag from the ontology and then identifying the text area to annotate with the mouse. Differently from similar annotation tools [4, 5], Melita actively supports training corpus annotation. While users annotate texts, Amilcare runs in the background learning how to reproduce the inserted annotation. Induced rules are silently applied to new texts and their results are compared with the user annotation. When its rules reach a (user-defined) level of accuracy, Melita presents new texts with a preliminary annotation derived by the rule application. In this case users have just to correct mistakes and add missing annotations. User corrections are inputted back to the learner for retraining. This technique focuses the slow and expensive user activity on uncovered cases, avoiding requiring annotating cases where a satisfying effectiveness is already reached. Moreover validating extracted information is a much simpler task than tagging bare texts (and also less error prone), speeding up the process considerably. At the end of the corpus annotation process, the system is trained and the application can be delivered. MnM [6] and Ontomat annotizer [7] are two annotation tools adopting Amilcare's learner.In this demo we simulate the annotation of a small corpus and we show how and when Amilcare is able to support users in the annotation process, focusing on the way the user can control the tool's proactivity and intrusivity. We will also quantify such support with data derived from a number of experiments on corpora. We will focus on training corpus size and correctness of suggestions when the corpus is increased.