A Cached-based approach to enrich Stream data with master data
To fulfill the increasing demand of business for the latest information, current data integration approaches are moving towards real-time updates. One important element in real-time data integration is the join of a continuous incoming data stream with a disk-based relation. In this paper we investigate a stream-based join algorithm, called mesh join (MESHJOIN), and propose an improved version called reduced MESHJOIN (R-MESHJOIN). Both algorithms tune the memory, allocating parts of the memory to key components. In MESHJOIN there is a dependency between the size of partitions in an internal queue for the stream data and the number of iterations required to bring the disk-based relation into memory. This dependency hampers the optimal distribution of memory among the join components. In particular the size of the disk-buffer varies with the size of the disk-based relation which is unnecessary. On the other hand the R-MESHJOIN algorithm removes this dependency. This enables an optimal distribution of available memory among the join components. In R-MESHJOIN a change in the size of the disk-based relation does not affect the size of the disk-buffer. An experimental study is conducted in order to validate the arguments.
Unfortunately, ACM prohibits us from displaying non-influential references for this paper.
To see the full reference list, please visit http://dl.acm.org/citation.cfm?id=1871952.