SCAD: collective discovery of attribute values

  • Authors:
  • Anton Bakalov;Ariel Fuxman;Partha Pratim Talukdar;Soumen Chakrabarti

  • Affiliations:
  • University of Massachusetts, Amherst, MA, USA;Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA;IIT, Bombay, India

  • Venue:
  • Proceedings of the 20th international conference on World wide web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Search engines today offer a rich user experience, no longer restricted to "ten blue links". For example, the query "Canon EOS Digital Camera" returns a photo of the digital camera, and a list of suitable merchants and prices. Similar results are offered in other domains like food, entertainment, travel, etc. All these experiences are fueled by the availability of structured data about the entities of interest. To obtain this structured data, it is necessary to solve the following problem: given a category of entities with its schema, and a set of Web pages that mention and describe entities belonging to the category, build a structured representation for the entity under the given schema. Specifically, collect structured numerical or discrete attributes of the entities. Most previous approaches regarded this as an information extraction problem on individual documents, and made no special use of numerical attributes. In contrast, we present an end-to-end framework which leverages signals not only from the Web page context, but also from a collective analysis of all the pages corresponding to an entity, and from constraints related to the actual values within the domain. Our current implementation uses a general and flexible Integer Linear Program (ILP) to integrate all these signals into holistic decisions over all attributes. There is one ILP per entity and it is small enough to be solved in under 38 milliseconds in our experiments. We apply the new framework to a setting of significant practical importance: catalog expansion for Commerce search engines, using data from Bing Shopping. Finally, we present experiments that validate the effectiveness of the framework and its superiority to local extraction.