UW branch of SciDB homepage

Efficient Iterative Processing in the SciDB Parallel Array Engine
Emad Soroush, Magdalena Balazinska, Simon Krughoff, and Andrew Connolly. SSDBM 2015
(Longer tech report)
Squeezing a Big Orange into Little Boxes: The AscotDB System for Parallel Processing of Data on a Sphere
J.Vanderplas, E. Soroush, S. Krughoff, M. Balazinska and A. Connolly, for publication in IEEE Data Engineering Bulletin, 2013.
A Demonstration of Iterative Parallel Array Processing in Support of Telescope Image Analysis
Matthew Moyers, Emad Soroush, Spencer C Wallace, Simon Krughoff, Jake Vanderplas, Magdalena Balazinska, and Andrew Connolly VLDB 2013
Time Travel in a Scientific Array Database
Emad Soroush and Magdalena Balazinska ICDE 2013
ArrayStore: A Storage Manager for Complex Parallel Array Processing
Emad Soroush, Magdalena Balazinska, and Daniel Wang. SIGMOD 2011
Hybrid Merge/Overlap Execution Technique for Parallel Array Processing
Emad Soroush and Magdalena Balazinska, ArrayDB Workshop to be held in conjunction with EDBT 2011.

Iterative Parallel Array Processing

	A large set of machine learning, web ranking, and graph processing algorithms run in an iterative fashion. No surprise many array analysis tasks also involve iterations that cover wide ranges of applications including data cleaning, structure or pattern discovery, tracking of structure or patterns over time, or model fitting. In this project we investigate possible implementations to perform this iterative array processing. The naive way is to use a driver program and express body of the iterative loop as a series of queries (in SciDB AQL queries), which can be very slow. Optimization opportunities include techniques such as incremental, asynchronous, and prioritized operations. We verify the effectiveness of our innovations by evaluating them on real scientific use-cases. For example, in astronomy, some sources are too faint to be detected in one image, but can be detected by stacking multiple images. The pixel summation over all images is called “co-add.” Astronomers run an iterative noise-reduction algorithm (sigma-clipping) before performing co-add. The noise-reduction algorithm repeatedly compares each pixel location across the stacked images and removes the extreme outliers so that each pixel in the final image is derived from the most “stable” pixels from the stack of images. See detailed description of this usecase. Other array iterative examples include “tracking simulated object trajectories” and a "Friends-of-Friends" clustering algorithm. Exciting News! We present AscotDB, a new tool for the analysis of telescope image data at VLDB2013. AscotDB results from the integration of Ascot, a web-based tool for the collaborative analysis of telescope images and metadata from astronomical telescopes and SciDB, a parallel array processing engine. We demonstrate the novel data exploration supported by this integrated tool on a 1 TB dataset comprising scientifically accurate, simulated telescope images. Stay tune for more news.
A Demonstration of Iterative Parallel Array Processing in Support of Telescope Image Analysis Matthew Moyers, Emad Soroush, Spencer C Wallace, Simon Krughoff,Jake Vanderplas, Magdalena Balazinska, and Andrew Connolly, VLDB2013

A large set of machine learning, web ranking, and graph processing algorithms run in an iterative fashion. No surprise many array analysis tasks also involve iterations that cover wide ranges of applications including data cleaning, structure or pattern discovery, tracking of structure or patterns over time, or model fitting. In this project we investigate possible implementations to perform this iterative array processing. The naive way is to use a driver program and express body of the iterative loop as a series of queries (in SciDB AQL queries), which can be very slow. Optimization opportunities include techniques such as incremental, asynchronous, and prioritized operations.

We verify the effectiveness of our innovations by evaluating them on real scientific use-cases. For example, in astronomy, some sources are too faint to be detected in one image, but can be detected by stacking multiple images. The pixel summation over all images is called “co-add.” Astronomers run an iterative noise-reduction algorithm (sigma-clipping) before performing co-add. The noise-reduction algorithm repeatedly compares each pixel location across the stacked images and removes the extreme outliers so that each pixel in the final image is derived from the most “stable” pixels from the stack of images. See detailed description of this usecase. Other array iterative examples include “tracking simulated object trajectories” and a "Friends-of-Friends" clustering algorithm.

Exciting News! We present AscotDB, a new tool for the analysis of telescope image data at VLDB2013. AscotDB results from the integration of Ascot, a web-based tool for the collaborative analysis of telescope images and metadata from astronomical telescopes and SciDB, a parallel array processing engine. We demonstrate the novel data exploration supported by this integrated tool on a 1 TB dataset comprising scientifically accurate, simulated telescope images. Stay tune for more news.

A Demonstration of Iterative Parallel Array Processing in Support of Telescope Image Analysis
Matthew Moyers, Emad Soroush, Spencer C Wallace, Simon Krughoff,Jake Vanderplas, Magdalena Balazinska, and Andrew Connolly, VLDB2013

Time Travel in a Scientific Array Database

	Scientific databases such as SciDB have to analyze Big Data quickly. In one aspect, they must efficiently store previous versions of data arrays so scientists can compare previous and current versions of their data. In this paper, we present TimeArr, a new storage manager for an array databases such as SciDB. TimeArr combines basic techniques such as backward-delta encoding, tiling, and bit-masking to vastly improve the performance of the versioning system as compared to the current design. We also extended their versioning system to support approximation results and a non-consecutive jumping backward delta called “skip links”. For detailed description please refer to the paper.
Time Travel in a Scientific Array Database Emad Soroush and Magdalena Balazinska ICDE 2013.

Scientific databases such as SciDB have to analyze Big Data quickly. In one aspect, they must efficiently store previous versions of data arrays so scientists can compare previous and current versions of their data. In this paper, we present TimeArr, a new storage manager for an array databases such as SciDB. TimeArr combines basic techniques such as backward-delta encoding, tiling, and bit-masking to vastly improve the performance of the versioning system as compared to the current design. We also extended their versioning system to support approximation results and a non-consecutive jumping backward delta called “skip links”. For detailed description please refer to the paper.

Time Travel in a Scientific Array Database
Emad Soroush and Magdalena Balazinska ICDE 2013.

ArrayStore: A Storage Manager for Complex Parallel Array Processing

	ArrayStore is a new read-only storage manager for complex, parallel array processing that that is developed in the University of Washington. ArrayStore supports a parallel and complex workload comprising not only range-queries, but also binary operations such as joins and complex user-defined functions. We made two contributions in this project. First, we examined several storage management and array partitioning strategies to identify which combination is best suited for the array-processing workload above. Second, we developed a new and efficient storage management mechanism that enables parallel processing of operations that must access data from adjacent partitions. The proposed mechanism is called overlap processing. For example, given a 2D telescope image, astronomers often run user defined functions that extract certain objects such as bright stars or galaxies. Because telescope images can be large, they need to partition these images and distribute them over multiple nodes. Then they run an observation (star) detection query in parallel on top of each partition. But how to detect objects that cross partition boundaries? Such objects can either be detected multiple times, once in each partition where they appear, or we can implement a complex post-processing merge operation that prunes duplicate results. A much nicer solution, however, is to provide each partition with a margin of overlapping data along its boundaries. The challenge, however, is how to select the appropriate amount of data overlap when partitioning the array. Tuning the system to find the right overlap-size is a difficult task. The overlap-size that is optimum for one workload incurs huge overhead on other workloads. In ArrayStore, I overcome this challenge by introducing a new technique that benefits from multi-layer overlapping strategy with materialized overlap views which is represented in Figure 1. This techniques significantly outperform approaches that do not provide overlap or provide only a pre-defined single overlap layer. Figure 1 shows different overlap execution techniques presented in the ArrayStore project. The overall performance gain when applying the technique that we developed was up to a factor of two for real queries on real data from astronomy and oceanography domains.
Figure1:Example of multi-layer overlap in a sparse array. Multi-layer overlap is like a set of onion-skin layers around chunks. In this array, extracting cluster C2 necessitates that the operator loads a small amount of overlap data denoted with L1. C3 requires an additional overlap layer, L2. None of the clusters requires the third layer L3.
ArrayStore: A Storage Manager for Complex Parallel Array Processing Emad Soroush, Magdalena Balazinska, and Daniel Wang. SIGMOD 2011

ArrayStore is a new read-only storage manager for complex, parallel array processing that that is developed in the University of Washington. ArrayStore supports a parallel and complex workload comprising not only range-queries, but also binary operations such as joins and complex user-defined functions.

We made two contributions in this project. First, we examined several storage management and array partitioning strategies to identify which combination is best suited for the array-processing workload above. Second, we developed a new and efficient storage management mechanism that enables parallel processing of operations that must access data from adjacent partitions. The proposed mechanism is called overlap processing.

For example, given a 2D telescope image, astronomers often run user defined functions that extract certain objects such as bright stars or galaxies. Because telescope images can be large, they need to partition these images and distribute them over multiple nodes. Then they run an observation (star) detection query in parallel on top of each partition. But how to detect objects that cross partition boundaries? Such objects can either be detected multiple times, once in each partition where they appear, or we can implement a complex post-processing merge operation that prunes duplicate results. A much nicer solution, however, is to provide each partition with a margin of overlapping data along its boundaries. The challenge, however, is how to select the appropriate amount of data overlap when partitioning the array.

Tuning the system to find the right overlap-size is a difficult task. The overlap-size that is optimum for one workload incurs huge overhead on other workloads. In ArrayStore, I overcome this challenge by introducing a new technique that benefits from multi-layer overlapping strategy with materialized overlap views which is represented in Figure 1. This techniques significantly outperform approaches that do not provide overlap or provide only a pre-defined single overlap layer. Figure 1 shows different overlap execution techniques presented in the ArrayStore project. The overall performance gain when applying the technique that we developed was up to a factor of two for real queries on real data from astronomy and oceanography domains.

Figure1:Example of multi-layer overlap in a sparse array. Multi-layer overlap is like a set of onion-skin layers around chunks. In this array, extracting cluster C2 necessitates that the operator loads a small amount of overlap data denoted with L1. C3 requires an additional overlap layer, L2. None of the clusters requires the third layer L3.

ArrayStore: A Storage Manager for Complex Parallel Array Processing
Emad Soroush, Magdalena Balazinska, and Daniel Wang. SIGMOD 2011

Hybrid Merge/Overlap Execution Technique

	After developping ArrayStore, we looked further at efficient strategies for parallel array processing. We examined different types of array operations and reviewed how they can be processed in parallel using two different techniques. The first technique, called merge, consists in partitioning an array, processing the partitions in parallel, then merging the result to reconcile any computations that need to span partition boundary. The second technique, called overlap, consists in partitioning an array into subarrays that overlap by a given number of cells along each dimension. Thanks to this overlap, the array partitions can be processed in parallel without any merge phase. We studied when each technique can be applied to an array operation. We showed that even for a single array operation a different approach may yield the best performance for different regions of an array. The overall performance of a distributed query is defined by the most time-consuming partition. A hybrid technique can alleviate those worst case scenarios. Following this observation, we developed a new parallel array processing technique that combines the merge and overlap approaches. Our technique enables a parallel array processing system to mix-and-match the merge and overlap techniques within a single operation on an array. We am now working on automating the selection of the best strategy to use.
Hybrid Merge/Overlap Execution Technique for Parallel Array Processing Emad Soroush and Magdalena Balazinska, ArrayDB Workshop to be held in conjunction with EDBT 2011.

After developping ArrayStore, we looked further at efficient strategies for parallel array processing. We examined different types of array operations and reviewed how they can be processed in parallel using two different techniques. The first technique, called merge, consists in partitioning an array, processing the partitions in parallel, then merging the result to reconcile any computations that need to span partition boundary. The second technique, called overlap, consists in partitioning an array into subarrays that overlap by a given number of cells along each dimension. Thanks to this overlap, the array partitions can be processed in parallel without any merge phase.

We studied when each technique can be applied to an array operation. We showed that even for a single array operation a different approach may yield the best performance for different regions of an array. The overall performance of a distributed query is defined by the most time-consuming partition. A hybrid technique can alleviate those worst case scenarios. Following this observation, we developed a new parallel array processing technique that combines the merge and overlap approaches. Our technique enables a parallel array processing system to mix-and-match the merge and overlap techniques within a single operation on an array. We am now working on automating the selection of the best strategy to use.

Hybrid Merge/Overlap Execution Technique for Parallel Array Processing
Emad Soroush and Magdalena Balazinska, ArrayDB Workshop to be held in conjunction with EDBT 2011.

Faculty


Magdalena Balazinska

Student/Postdoc


	Jake VanderPlas	Emad Soroush	Matt Moyers

UW SciDB branch