Effective named entity recognition for idiosyncratic web collections

  • Authors:
  • Roman Prokofyev;Gianluca Demartini;Philippe Cudré-Mauroux

  • Affiliations:
  • University of Fribourg, Fribourg, Switzerland;University of Fribourg, Fribourg, Switzerland;University of Fribourg, Fribourg, Switzerland

  • Venue:
  • Proceedings of the 23rd international conference on World wide web
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Named Entity Recognition (NER) plays an important role in a variety of online information management tasks including text categorization, document clustering, and faceted search. While recent NER systems can achieve near-human performance on certain documents like news articles, they still remain highly domain-specific and thus cannot effectively identify entities such as original technical concepts in scientific documents. In this work, we propose novel approaches for NER on distinctive document collections (such as scientific articles) based on n-grams inspection and classification. We design and evaluate several entity recognition features---ranging from well-known part-of-speech tags to n-gram co-location statistics and decision trees---to classify candidates. In addition, we show how the use of external knowledge bases (either specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic collections. We evaluate our system on two test collections created from a set of Computer Science and Physics papers and compare it against state-of-the-art supervised methods. Experimental results show that a careful combination of the features we propose yield up to 85% NER accuracy over scientific collections and substantially outperforms state-of-the-art approaches such as those based on maximum entropy.