Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web

  • Authors:
  • Jean-Daniel Cryans;Sylvie Ratté;Roger Champagne

  • Affiliations:
  • -;-;-

  • Venue:
  • DBKDA '10 Proceedings of the 2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Semantic Web has made possible the use of the Internet to extract useful content, a task that could necessitate an infrastructure across the Web. With Hadoop, a free implementation of the MapReduce programming paradigm created by Google, we can treat these data reliably over hundreds of servers. This article describes how the Apriori algorithm was adapted to MapReduce in the search for relations between entities to deal with thousands of Web pages coming from RSS feeds daily. First, every feed is looked up five times per day and each entry is registered in a database with MapReduce. Second, the entries are read and their content sent to the Web service OpenCalais for the detection of named entities. For each Web page, the set of all itemsets found is generated and stored in the database. Third, all generated sets, from first to last, are counted and their support is registered. Finally, various analytical tasks are executed to present the relationships found. Our tests show that the third step, executed over 3,000,000 sets, was 4.5 times faster using five servers than using a single machine. This approach allows us to easily and automatically distribute treatments on as many machines as are available, and be able to process datasets that one server, even a very powerful one, would not be able to manage alone. We believe that this work is a step forward in processing semantic Web data efficiently and effectively.