>>2008,ZHOU Yan, Spatial data declustering method and parallel spatial index
Abstract
Spatial database technology is always one core research area of spatial information science. With the rapid development of spatial information infrastructure and spatial data acquire technology, the scale of spatial data became larger and larger, which leading to more and more demand for fast spatial data accessing. At the same time, the continue emerged new applications fields of spatial data such as spatial data warehouse, spatial data mining, and especially network spatial information service represented by Google Earth have a growing demand for high performance spatial database system. It is difficult to improve database system performance simply rely on hardware development, so researchers try to develop parallel spatial database technology so as to enhance spatial database system performance. It has been a hot point of current research.
Spatial data declustering and parallel spatial index are two important aspects of parallel database technology. By distributed storing spatial data into different computer nodes, parallel spatial database can obtain parallel data I/O capacity and improve spatial database accessing performance. Effective parallel spatial indexing mechanism is used to achieve rapid response of spatial square and improve spatial data accessing performance. Due to the complexity of spatial data itself (such as unstructured variable-length structure, non-uniform distribution, as well as complex spatial relationships, etc.), the current methods of spatial data declustering mostly use the same methods as common parallel database. These methods did not consider the characteristics of spatial data so that hardly get a balanced spatial data distribution, which seriously hampering the parallel performance of database system. This paper concentrates on the data declustering problem of parallel spatial database, which mainly include the following contents:
(1) After comprehensively analyzing the existing distributed storage methods of spatial data and comparing the main characteristics between one-dimensional and multi-dimensional data distribution methods, the major problems existed in current data distribution are summarized and extracted: (a)The current methods did not fully consider the unstructured variable-length and non-uniform distribution characteristics of spatial data, which resulting in an unbalanced distribution of spatial data and can impact parallel performance of database system; (b)These methods did not take full account of the relationship between objects and could break up the spatial relationship between spatial objects, which can impact system performance. Aiming to these problems, the reasonable design principles of spatial data declustering method were proposed.
(2) After analyzing the impact of the spatial locality, a spatial data declustering method based on Hilbert curve hierarchical decomposition was presented In this method, the adjacent relations among objects are kept by Hilbert curve. According to the unstructured variable-length and non-uniform distribution characteristics of spatial data, smaller size grid is used in the area where spatial objects are very dense while lager size grid is used in the area where spatial objects are very sparse. The balanced distribution of spatial data among computer nodes can be achieved by hierarchical decomposition of initial partitioning grid. Experiments show that this method can effectively overcome the drawbacks of existing spatial data declustering methods and achieve a good storage balance. It is particularly suited to the uneven distributed spatial data sets. Compared to other methods, the proposed method is more efficient and can achieve a better parallelism ratio.
(3) A data skew correction method based on the minimum spatial proximity is proposed . Dynamic updating of spatial databases can easily damage the initial distribution balancing among nodes, which result in data skew. Because of lacking dynamic balancing mechanisms, the existing methods can only achieve uniform distribution on static spatial databases and have little effect on supporting dynamic spatial databases. To solve this problem, firstly the major types and the main solution of data skew are analyzed. Then the main methods to deal with data skew problem are summarized; The flaws existed in the general dynamic balancing methods is that considering only balance of data volume but not considering spatial locality so as to reduce system parallel performance. We present a new data skew correction method based on the minimum spatial adjacent degree was presented. The proposed method take spatial proximity as a standard when making the dynamic distribution adjustment strategy of spatial data, which consider the system parallelism and data distribution balance together and break the limitations of the existing methods that considering of the data volume balance only.
(4) A distributed multi-levels parallel spatial indexing structure DMP-Rtree is presented. After analyzed the inherent parallelism of spatial index tree and summarized the three main spatial index structures as well as two implementation mechanisms of parallel spatial index, a distributed multi-levels parallel spatial indexing structure named DMP-Rtree was presented, which aims to solve the problems in general parallel spatial index methods such as index hotspots, access bottlenecks and the difficulty to maintain consistency of parallel spatial index structure. DMP-Rtree take distributed indexing structure to avoid index hotspots and access bottlenecks emerged in traditional master-slave model of parallel spatial index. By multi-levels indexing structure, the entire spatial index will be divided into a sets of sub-Rtree Index, which helps to restrict the works of index maintaining in each sub-Rtree itself as much as possible and can reduces the complexity of maintaining parallel spatial index structure; The detail of the DMP-Rtree structure as well as related indexing algorithms is illustrated in this paper. Spatial query experiments also verified the validity of the DMP-Rtree structure and its rapid query response performance.
The parallel spatial database oriented distributed storage technology and parallel spatial index method presented in this paper has been tested by using a real spatial datasets and VC + MPI programming environment. The experimental results show that the method is correct and effective, which provides an efficient way for establishing large-scale parallel spatial databases.
Key Words
Parallel Spatial Databases, Spatial Data Declustering, Parallel Spatial Indexing, Spatial Data Skew, Dynamic Data Balancing