TwigStack-MR: An Approach to Distributed XML Twig Query Using MapReduce

Abstract

Twig pattern query is the core operation of XML process, which directly affects the efficiency of XML data query. It is a challenge to manipulate massive XML data, especially on distributed cluster, such as how to effectively ensure the completeness and correctness of the query results, and minimize communication costs between the various machines. In this paper, we present TwigStack-MR, which simultaneously processes several twig pattern queries for a massive volume of XML data based on MapReduce framework. We first split the large scale XML data file into file-splits as input to the distributed storage system. Then we present the distributed twig algorithm, processing different subtrees of the document tree in parallel. Finally we use the MapReduce framework, full characteristics of distributed environments, to process twig query efficiently. The experimental results show that our approach is efficient and scalable on this issue.

DOI: 10.1109/BigDataCongress.2016.79

16 Figures and Tables

Cite this paper

@article{Fan2016TwigStackMRAA, title={TwigStack-MR: An Approach to Distributed XML Twig Query Using MapReduce}, author={Hongjie Fan and Han Yang and Zhiyi Ma and Junfei Liu}, journal={2016 IEEE International Congress on Big Data (BigData Congress)}, year={2016}, pages={133-140} }