Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web

Authors:
Jean-Daniel Cryans;Sylvie Ratté;Roger Champagne
Affiliations:
-;-;-
Venue:
DBKDA '10 Proceedings of the 2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications
Year:
2010

Citing 0
Cited 5

A novel semantic web browser for user centric information retrieval: PERSON

Expert Systems with Applications: An International Journal
Integrating linked data into the content value chain: a review of news-related standards, methodologies and licensing requirements

Proceedings of the 8th International Conference on Semantic Systems
Semantic metadata in the news production process: achievements and challenges

Proceeding of the 16th International Academic MindTrek Conference
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Proceedings of the 21st ACM international conference on Information and knowledge management
A decentralized approach for mining event correlations in distributed system monitoring

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Semantic Web has made possible the use of the Internet to extract useful content, a task that could necessitate an infrastructure across the Web. With Hadoop, a free implementation of the MapReduce programming paradigm created by Google, we can treat these data reliably over hundreds of servers. This article describes how the Apriori algorithm was adapted to MapReduce in the search for relations between entities to deal with thousands of Web pages coming from RSS feeds daily. First, every feed is looked up five times per day and each entry is registered in a database with MapReduce. Second, the entries are read and their content sent to the Web service OpenCalais for the detection of named entities. For each Web page, the set of all itemsets found is generated and stored in the database. Third, all generated sets, from first to last, are counted and their support is registered. Finally, various analytical tasks are executed to present the relationships found. Our tests show that the third step, executed over 3,000,000 sets, was 4.5 times faster using five servers than using a single machine. This approach allows us to easily and automatically distribute treatments on as many machines as are available, and be able to process datasets that one server, even a very powerful one, would not be able to manage alone. We believe that this work is a step forward in processing semantic Web data efficiently and effectively.