Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach

  • Authors:
  • Abhishek Gattani;Digvijay S. Lamba;Nikesh Garera;Mitul Tiwari;Xiaoyong Chai;Sanjib Das;Sri Subramaniam;Anand Rajaraman;Venky Harinarayan;AnHai Doan

  • Affiliations:
  • @WalmartLabs;@WalmartLabs;@WalmartLabs;LinkedIn;@WalmartLabs;University of Wisconsin-Madison;@WalmartLabs;Cambrian Ventures;Cambrian Ventures;@WalmartLabs and University of Wisconsin-Madison

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many applications that process social data, such as tweets, must extract entities from tweets (e.g., "Obama" and "Hawaii" in "Obama went to Hawaii"), link them to entities in a knowledge base (e.g., Wikipedia), classify tweets into a set of predefined topics, and assign descriptive tags to tweets. Few solutions exist today to solve these problems for social data, and they are limited in important ways. Further, even though several industrial systems such as OpenCalais have been deployed to solve these problems for text data, little if any has been published about them, and it is unclear if any of the systems has been tailored for social media. In this paper we describe in depth an end-to-end industrial system that solves these problems for social data. The system has been developed and used heavily in the past three years, first at Kosmix, a startup, and later at WalmartLabs. We show how our system uses a Wikipedia-based global "real-time" knowledge base that is well suited for social data, how we interleave the tasks in a synergistic fashion, how we generate and use contexts and social signals to improve task accuracy, and how we scale the system to the entire Twitter firehose. We describe experiments that show that our system outperforms current approaches. Finally we describe applications of the system at Kosmix and WalmartLabs, and lessons learned.