|
Computer science
The streaming processing of SAR data in distributed environment with Apache Spark
V. P. Potapov, M. A. Kostylev, S. E. Popov Institute of Computational Technologies of the Siberian Branch
of the Russian Academy of Sciences, 6, Academician M. A. Lavrentiev pr., Novosibirsk,
630090, Russian Federation
Abstract:
This article presents a modern approach to creating a distributed program complex based on mass-parallel technology for pre- and postprocessing of SAR images. The unique features of the system is the ability to work in real time mode with huge amounts of streaming data and applying existing algorithms that are not used for distributed processing on multiple nodes without changing the algorithms' implementation. A comparison has been made of distributed processing technologies based on which we have selected Apache Spark. The ability to organise automatic processing of input SAR images as a sequence of operations which should be performed based on defined conditions is demonstrated. The results of processing store in the system as fault tolerant distributed collections of data (RDD-Resilient Distributed Data), which allows getting and saving the intermediate results in the distributed file system HDFS as and when new space images became available and processed by the sequence of algorithms. This article described the implementation for the specific tasks of SAR data processing based on the suggested approach is described (phase estimation, coregistration, interferogram creation and phase unwrapping with region growing method). A scheme of the phase unwrapping algorithm with the ability to use GPU and NVIDIA CUDA technology is presented. An adaptation of the algorithm for the mass-parallel systems is shown. The algorithm implementation focused on processing pair of SAR images on one node. Performance growth is achieved by simultaneous processing multiple images whose number is equal to cluster nodes count. An example of methods implementation for working with streaming binary data (BinaryRecordStream) which perform monitoring of new SAR data in distributed file system HDFS and reading$\backslash$writing this data as binary files with fixed bytes size is shown. A directory and size of one record are used as the input parameters. The results of testing developed algorithms on demonstration cluster is presented. A possibility of getting up to eight times better processing speed using eight nodes in a cluster for the same images count in comparison with sequential processing on one node is shown. Results of testing provide the ability to improve the performance of presented algorithms without any changes in implementation and this in turn justifies the utility of applying distributed approach for SAR data processing. Refs 26. Figs 4. Tables 3.
Keywords:
Apache Spark, Apache Hadoop, distributed information systems, sar interfometry, processing algorithms.
Received: September 15, 2016 Accepted: April 11, 2017
Citation:
V. P. Potapov, M. A. Kostylev, S. E. Popov, “The streaming processing of SAR data in distributed environment with Apache Spark”, Vestnik S.-Petersburg Univ. Ser. 10. Prikl. Mat. Inform. Prots. Upr., 13:2 (2017), 168–181
Linking options:
https://www.mathnet.ru/eng/vspui330 https://www.mathnet.ru/eng/vspui/v13/i2/p168
|
|