A Parallel Corpus Labeled Using Open and Restricted Domain Ontologies

  • Authors:
  • Ester Boldrini;Sergio Ferrández;Ruben Izquierdo;David Tomás;Jose Luis Vicedo

  • Affiliations:
  • Natural Language Processing and Information Systems Group Department of Software and Computing Systems, University of Alicante, Spain;Natural Language Processing and Information Systems Group Department of Software and Computing Systems, University of Alicante, Spain;Natural Language Processing and Information Systems Group Department of Software and Computing Systems, University of Alicante, Spain;Natural Language Processing and Information Systems Group Department of Software and Computing Systems, University of Alicante, Spain;Natural Language Processing and Information Systems Group Department of Software and Computing Systems, University of Alicante, Spain

  • Venue:
  • CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The analysis and creation of annotated corpus is fundamental for implementing natural language processing solutions based on machine learning. In this paper we present a parallel corpus of 4500 questions in Spanish and English on the touristic domain, obtained from real users. With the aim of training a question answering system, the questions were labeled with the expected answer type, according to two different ontologies. The first one is an open domain ontology based on Sekine's Extended Named Entity Hierarchy, while the second one is a restricted domain ontology, specific for the touristic field. Due to the use of two ontologies with different characteristics, we had to solve many problematic cases and adjusted our annotation thinking on the characteristics of each one. We present the analysis of the domain coverage of these ontologies and the results of the inter-annotator agreement. Finally we use a question classification system to evaluate the labeling of the corpus.