The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
ETL queues for active data warehousing
Proceedings of the 2nd international workshop on Information quality in information systems
The Long Tail: Why the Future of Business Is Selling Less of More
The Long Tail: Why the Future of Business Is Selling Less of More
Meshing Streaming Updates with Persistent Data in an Active Data Warehouse
IEEE Transactions on Knowledge and Data Engineering
An Event-Based Near Real-Time Data Integration Architecture
EDOCW '08 Proceedings of the 2008 12th Enterprise Distributed Object Computing Conference Workshops
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
R-MESHJOIN for near-real-time data warehousing
DOLAP '10 Proceedings of the ACM 13th international workshop on Data warehousing and OLAP
Semi-Streamed Index Join for near-real time execution of ETL transformations
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Hi-index | 0.00 |
Many stream-based applications have plenty of resources available to them, but there are also applications where resource consumption must be limited. For one important class of stream-based joins, where a stream is joined with a non-stream master data set, the algorithm called MESHJOIN was proposed. MESHJOIN uses limited memory and is a candidate for a resource-aware system setup. The problem that is considered in this paper is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios sub-optimal. We present an algorithm CACHEJOIN, which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, we compare both algorithms using a synthetic data set with a known skewed distribution.