ELEKS Labs: April 2013

4/22/2013

Parallelism in Testing – To Be or Not To Be?

Today I would like to uncover the topic of parallel testing in context of test automation approaches: its potential benefits and possible scenarios.

So let’s dwell on the matter of test automation as we know it. What can we call a common benefit on this regard? The most obvious one can be referred to as “resource savings”. How about we stop for a minute and try to define the tasks that are most frequently automated. These can be – repetitive inputs, UI navigation, performance measuring, etc. The next point would be the product areas which are eligible for the automation itself. In case when one has to achieve 100% test automation coverage, he/she may claim – “It is possible to automate every bit of the program functionality!” – fine by me, but is it reasonable? Is it worth spending hours of work to produce an automated test that will be run only once or twice, but can actually be tested manually in just a few minutes? Don’t expect a unanimous answer, but in majority of such cases - resources spent are not worth results provided. Another point that can be hard to overcome during the automation tests implementation is the maintenance of test cases. This problem mostly affects products that tend to change frequently during the development process or during the product lifecycle.

Speeding up MR Image Reconstruction with GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA)

This time we would like to share some details about realization of one of our projects in medical imaging.
Currently one of the bottlenecks in MR image reconstruction is speed improvement. Improving the speed of image reconstruction is difficult from algorithmic point of view. But it’s becoming more popular to improve algorithm performance using GPU.

Introduction

In magnetic resonance (MR) image reconstruction raw data measured from the scanner correspond to the Fourier coefficients of the target image, so the fast Fourier transform (FFT) is used for reconstruction. An important procedure called the density compensation is necessary in the very beginning to account for the non-uniform sampling. GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA) is a partially parallel acquisition (PPA) method which is used to reduce scan times. In this case only partial data is acquired and missing data is mathematically calculated from available data.

Subject for optimization

Original implementation was based on FFTW library for Fast Fourier transforms and adapted Singular Value Decomposition (SVD) algorithm from ACM Collected Algorithms for GRAPPA preprocessing. These two algorithms are the most computationally intensive parts of the whole image reconstruction. FFTW library is claimed to be the fastest CPU implementation using all possible optimizations like SSE2/3 and hyper threading, however it does not leverages the power of modern GPU cards. SVD algorithm was done on CPU as well. It is known to be badly parallelizable for small matrices, but in case of GRAPPA algorithm we have many image frames with same size which can be processed in parallel. Besides there are many intermediate steps which consume a lot of CPU and they can be easily parallelized on GPU.

Technical analysis

FFTW library performance is comparable with Intel MKL implementation. NVidia provides comparison for their CUDA based cuFFT library with MKL implementation (Figure 1):

Figure 1 Comparison of CUDA based cuFFT library with MKL implementation

According to this we should achieve up to 8-10x faster FFT processing when using GPU accelerated cuFFT library from Nvidia. GPU accelerated SVD algorithm is also available, for example CULA library by EM Photonics. However, current CULA library implementation does not support batch mode, so we will need to process all image frames as a sequence. Brief testing showed that 64 image frames (256*256) are processed even slower than CPU based version. Since we haven’t found any good alternative to CULA library we decided to implement our own GPU accelerated version of SVD algorithm.

Implementation

FFT part of image reconstruction when using cuFFT library was straightforward, however we had to deal with image frames which does not fit into available GPU memory. We had to write algorithm to run FFT over portions of the large data frame with subsequent aggregation. Figure 2.1 below shows case when all data fits into GPU memory.

Figure 2.1

Figure 2.2 illustrates the case when huge data is processed. Solid lines in figure below show measured performance, dashed lines show estimated time in case data fits into GPU memory.

Figure 2.2

Much more interesting was to implement GPU accelerated SVD algorithm with batch mode. All implementations we had found are focusing on maximum parallelization of a single SVD run, hence we had to change approach. Basically SVD algorithm consists of HouseHolder Reduction, QR Diagonalization and Back Transformation steps. All are iterative processes when next step depends on results from previous step. In case of small matrices each CUDA kernel can’t effectively utilize all parallel processing units of modern GPU. So we had to write kernels in a way when every iteration for all matrices is processed by a single kernel run. This way in case of 64 matrices with 128x128 size each we can process 64*128 elements at a time instead of 128. Figures 3.1 and 3.2 show performance comparison for CULA Library and our implementation.

Figure 3.1

Figure 3.2

With more than 8 frames per batch our implementation shows much better performance comparing to sequential CULA calls, although it is not so efficient for a single frame.

Results

As a result we have developed a pure C++ library with a set of specialized functions which perform various stages of image reconstruction. It requires only CUDA runtime libraries and free cuFFT library provided by NVidia. In addition we have implemented lightweight C# wrapper for convenient usage. Also we have run a lot of benchmarks with various GPU cards and on different platforms. On test cases provided by customer we received up to 150x speedup comparing to original single-threaded CPU implementation. However significant part of received speedup was due to poorly optimized original code which was completely rewritten and ported to CUDA whenever possible.

While it is usually understood what FFT stage does in image reconstruction, GRAPPA stage is not so obvious. Due to parallel acquisition of different frames arises distortion of acquired data which is effectively eliminated. Figure 4 shows visual representation of images before and after reconstruction.

Figure 4 The image before the reconstruction (left), image after reconstruction (right)

Additionally, you can find a case-study on ELEKS website or download it in PDF. Stay tuned!

/by Volodymyr Kozak, Principal Developer, ELEKS R&D

ELEKS open-source projects directory

Even though ELEKS is a software development services company and we mainly develop proprietary software for our customers, we have our own open-source projects and contribute to projects maintained by other people and organizations. We are proud to be a part of the open-source community. Recently we have launched GitHub page where you can find a list of our projects and projects we contribute to: http://eleks.github.io/.
We are going to publish more open-source projects soon. Stay tuned!

4/04/2013

20x performance for the same HW price: GPGPU-based Machine Learning at Yandex

Russian search-engine Yandex has disclosed some details about their machine-learning framework, FML. The most interesting detail is that it runs on 80 TFLOPS cluster powered by GPGPU. This is quite unusual application for GPU, as ML algorithms are usually hard to be paralleled. However they have managed to adapt their decision tree algorithm for high-level of parallelism. As a result Yandex has achieved more than 20x speed-up for the same hardware price.
They are going to upgrade their cluster to 300 TFLOPS. Yandex expects its cluster to be in the list of top 100 most powerful supercomputers in the world after that upgrade.

4/01/2013

The end of the Javascript domination

The world is changed,
I feel it in the water,
I feel it in the earth.
I smell it in the air

The Lord of the Rings: The Fellowship of the Ring

Mozilla's ASM.js release last week created serious buzz in the programming community. Some people are excited by it, some criticize it, but almost everybody agrees that it may have significant impact on the future of the web.
I consider it as one more nail in monolanguage web's coffin. Today there are more than 200 languages and tools related to Javascript generation and this number is increasing every week. Of course, most of them are no more than amateur projects that are designed to solve some kind of Javascript problem. However, there is a clear trend. The reason of this trend is obvious: complexity of web client-side is increasing and Javascript is not always ready to support this complexity. Most of these tools have emerged over the last few years. Here are few of them I consider to be most important:

Language/Tool	Year of appearance
CoffeeScript	2009
Dart	2011
TypeScript	2012
Emscripten	2012
ASM.js	2013

What is really important is that three major browser vendors seems to be supporters of this movement: Google (with its Dart), Microsoft (TypeScript) and Mozilla (ASM.js). Together they hold 90% of browser market share and definitely have enough power and influence to change client-side programming landscape over the next few years.
I don't mean that Dart, TypeScript or some other language will replace Javascript completely. It will always be here, but it is hard to argue with the fact that Javascript is no longer the only client-side programming language in the web. It still dominates, but things are going to change over the next few years.
Any thoughts?