Corpus ID: 37665897

GeoSpark : A Cluster Computing Framework for Processing Spatial Data

  title={GeoSpark : A Cluster Computing Framework for Processing Spatial Data},
  author={Jia Yu and Jinxuan Wu},
This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDD to support… Expand

Figures and Tables from this paper

SparkNN: A Distributed In-Memory Data Partitioning for KNN Queries on Big Spatial Data
SparkNN, an in-memory partitioning and indexing system for answering spatial queries, such as K-nearest neighbor, on big spatial data is proposed and significantly outperforms the state-of-the-art Spark system when evaluated on the same set of queries. Expand
Spatio-Temporal Data Streams
Spatio-Temporal Data Streams is a valuable resource for researchers studying spatio-temporal data streams and Big Data analytics, as well as data engineers and data scientists solving data management and analytics problems associated with this class of data. Expand
Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data
This paper presents Sphinx, a full-fledged open-source system for big spatial data which overcomes the limitations of existing systems by adopting a standard SQL interface, and by providing a highExpand
HiVision: Rapid Visualization of Large-Scale Spatial Vector Data
Experiments show that the HiVision approach outperforms traditional methods in rendering speed and visual effects while dealing with large-scale spatial vector data, and can provide interactive visualization of datasets with billion-scale points/segments/edges in real-time with flexible rendering styles. Expand
Big spatial vector data management: a review
A review that surveys recent studies and research work in the data management field for BSVD and concludes systematically not only the most recent published literatures but also a global view of main spatial technologies of BSVD, including data storage and organization, spatial index, processing methods, and spatial analysis. Expand


Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce
Hadoop-GIS - a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop and integrated into Hive to support declarative spatial queries with an integrated architecture is presented. Expand
A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data
This demo presents SpatialHadoop as the first full-fledged MapReduce framework with native support for spatial data and demonstrates a real system prototype of Spatial Hadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap. Expand
Data Partitioning for Parallel Spatial Join Processing
It is shown that a near-optimal speedup can be achieved for parallel spatial join processing using the filter-and-refine strategy for spatial operation processing and the key to overcome this problem is to preserve spatial locality in task decomposition. Expand
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services
The design of MD-HBase is presented, a scalable data management system for LBSs that builds two standard index structures–the K-d tree and the Quad treeâ€"over a range partitioned Key-value store and allows efficient multi-dimensional query processing. Expand
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand
R-trees: a dynamic index structure for spatial searching
A dynamic index structure called an R-tree is described which meets this need, and algorithms for searching and updating it are given and it is concluded that it is useful for current database systems in spatial applications. Expand
MapReduce: Simplified Data Processing on Large Clusters
This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable. Expand
The ganglia distributed monitoring system: design, implementation, and experience
The design, implementation, and evaluation of Ganglia are presented along with experience gained through real world deployments on systems of widely varying scale, configurations, and target application domains over the last two and a half years. Expand
Parallel Secondo: Boosting Database Engines with Hadoop
  • Jiamin Lu, R. H. Güting
  • Computer Science
  • 2012 IEEE 18th International Conference on Parallel and Distributed Systems
  • 2012
This paper attempts to propose a light and efficient coupling structure to combine Hadoop with single-computer databases on the engine level, and provides a simple and independent distributed file system to transfer data among database engines directly, without passing through HDFS, hence to remove as much as possible unnecessary transform and transfer overhead. Expand
A non-blocking parallel spatial join algorithm
Results from a prototype implementation in a commercial parallel object-relational DBMS show that the proposed parallel non-blocking spatial join algorithm uses duplicate avoidance rather than duplicate elimination, and that its rate of producing answer tuples scales with the number of processors. Expand