Extracting information networks from the blogosphere

  • Authors:
  • Yuval Merhav;Filipe Mesquita;Denilson Barbosa;Wai Gen Yee;Ophir Frieder

  • Affiliations:
  • Illinois Institute of Technology, Chicago, IL;University of Alberta;University of Alberta;Orbitz Worldwide, Chicago, IL;Georgetown University, Washington, D.C.

  • Venue:
  • ACM Transactions on the Web (TWEB)
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the problem of automatically extracting information networks formed by recognizable entities as well as relations among them from social media sites. Our approach consists of using state-of-the-art natural language processing tools to identify entities and extract sentences that relate such entities, followed by using text-clustering algorithms to identify the relations within the information network. We propose a new term-weighting scheme that significantly improves on the state-of-the-art in the task of relation extraction, both when used in conjunction with the standard tf ċ idf scheme and also when used as a pruning filter. We describe an effective method for identifying benchmarks for open information extraction that relies on a curated online database that is comparable to the hand-crafted evaluation datasets in the literature. From this benchmark, we derive a much larger dataset which mimics realistic conditions for the task of open information extraction. We report on extensive experiments on both datasets, which not only shed light on the accuracy levels achieved by state-of-the-art open information extraction tools, but also on how to tune such tools for better results.