Semantic annotation of unstructured and ungrammatical text

  • Authors:
  • Matthew Michelson;Craig A. Knoblock

  • Affiliations:
  • University of Southern California, Information Sciences Institute, Marina del Rey, CA;University of Southern California, Information Sciences Institute, Marina del Rey, CA

  • Venue:
  • IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

There are vast amounts of free text on the internet that are neither grammatical nor formally structured, such as item descriptions on Ebay or internet classifieds like Craig's list. These sources of data, called "posts," are full of useful information for agents scouring the Semantic Web, but they lack the semantic annotation to make them searchable. Annotating these posts is difficult since the text generally exhibits little formal grammar and the structure of the posts varies. However, by leveraging collections of known entities and their common attributes, called "reference sets," we can annotate these posts despite their lack of grammar and structure. To use this reference data, we align a post to a member of the reference set, and then exploit this matched member during information extraction. We compare this extraction approach to more traditional information extraction methods that rely on structural and grammatical characteristics, and we show that our approach outperforms traditional methods on this type of data.