Discovering missing values in semi-structured databases

  • Authors:
  • Xing Yi;James Allan;Victor Lavrenko

  • Affiliations:
  • Drive University of Massachusetts, Amherst, MA;Drive University of Massachusetts, Amherst, MA;Drive University of Massachusetts, Amherst, MA

  • Venue:
  • Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We explore the problem of discovering multiple missing values in a semi-structured database. For this task, we formally develop Structured Relevance Model (SRM) built on one hypothetical generative model for semi-structured records. SRM is based on the idea that plausible values for a given field could be inferred from the context provided by the other fields in the record. Small-scale experiments on IMDb (Internet Movie Database) show that SRM matched three state-of-the-art relational learning approaches on the movie label prediction tasks. Large-scale experiments on a snapshot of the National Science Digital Library (NSDL) repository show that SRM is highly effective at discovering possible values for free-text fields even with quite modest amounts of training data, compared with state-of-the-art machine learning approaches.