SCAD: collective discovery of attribute values

Authors:
Anton Bakalov;Ariel Fuxman;Partha Pratim Talukdar;Soumen Chakrabarti
Affiliations:
University of Massachusetts, Amherst, MA, USA;Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA;IIT, Bombay, India
Venue:
Proceedings of the 20th international conference on World wide web
Year:
2011

Citing 20
Cited 1

Elements of information theory

Elements of information theory
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Collective information extraction with relational Markov networks

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A shortest path dependency kernel for relation extraction

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds

Proceedings of the 16th international conference on World Wide Web
Autonomously semantifying wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
EntityRank: searching entities directly and holistically

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Introduction to Information Retrieval

Introduction to Information Retrieval
Information Extraction

Foundations and Trends in Databases
Collective annotation of Wikipedia entities in web text

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Robust web extraction: an approach based on a probabilistic tree-edit model

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Learning to rank for quantity consensus queries

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Learning field compatibilities to extract database records from unstructured text

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Learning and inference with constraints

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Semi-supervised learning of attribute-value pairs from product descriptions

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Numerical data integration for cooperative question-answering

KRAQ '06 Proceedings of the Workshop KRAQ'06 on Knowledge and Reasoning for Language Processing
Exploiting content redundancy for web information extraction

Proceedings of the 19th international conference on World wide web
Extraction and approximation of numerical attributes from the Web

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Coupled temporal scoping of relational facts

Proceedings of the fifth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines today offer a rich user experience, no longer restricted to "ten blue links". For example, the query "Canon EOS Digital Camera" returns a photo of the digital camera, and a list of suitable merchants and prices. Similar results are offered in other domains like food, entertainment, travel, etc. All these experiences are fueled by the availability of structured data about the entities of interest. To obtain this structured data, it is necessary to solve the following problem: given a category of entities with its schema, and a set of Web pages that mention and describe entities belonging to the category, build a structured representation for the entity under the given schema. Specifically, collect structured numerical or discrete attributes of the entities. Most previous approaches regarded this as an information extraction problem on individual documents, and made no special use of numerical attributes. In contrast, we present an end-to-end framework which leverages signals not only from the Web page context, but also from a collective analysis of all the pages corresponding to an entity, and from constraints related to the actual values within the domain. Our current implementation uses a general and flexible Integer Linear Program (ILP) to integrate all these signals into holistic decisions over all attributes. There is one ILP per entity and it is small enough to be solved in under 38 milliseconds in our experiments. We apply the new framework to a setting of significant practical importance: catalog expansion for Commerce search engines, using data from Bing Shopping. Finally, we present experiments that validate the effectiveness of the framework and its superiority to local extraction.