ELEKS Labs: CUDA

Showing posts with label CUDA. Show all posts

4/09/2013

Speeding up MR Image Reconstruction with GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA)

This time we would like to share some details about realization of one of our projects in medical imaging.
Currently one of the bottlenecks in MR image reconstruction is speed improvement. Improving the speed of image reconstruction is difficult from algorithmic point of view. But it’s becoming more popular to improve algorithm performance using GPU.

Introduction

In magnetic resonance (MR) image reconstruction raw data measured from the scanner correspond to the Fourier coefficients of the target image, so the fast Fourier transform (FFT) is used for reconstruction. An important procedure called the density compensation is necessary in the very beginning to account for the non-uniform sampling. GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA) is a partially parallel acquisition (PPA) method which is used to reduce scan times. In this case only partial data is acquired and missing data is mathematically calculated from available data.

Subject for optimization

Original implementation was based on FFTW library for Fast Fourier transforms and adapted Singular Value Decomposition (SVD) algorithm from ACM Collected Algorithms for GRAPPA preprocessing. These two algorithms are the most computationally intensive parts of the whole image reconstruction. FFTW library is claimed to be the fastest CPU implementation using all possible optimizations like SSE2/3 and hyper threading, however it does not leverages the power of modern GPU cards. SVD algorithm was done on CPU as well. It is known to be badly parallelizable for small matrices, but in case of GRAPPA algorithm we have many image frames with same size which can be processed in parallel. Besides there are many intermediate steps which consume a lot of CPU and they can be easily parallelized on GPU.

Technical analysis

FFTW library performance is comparable with Intel MKL implementation. NVidia provides comparison for their CUDA based cuFFT library with MKL implementation (Figure 1):

Figure 1 Comparison of CUDA based cuFFT library with MKL implementation

According to this we should achieve up to 8-10x faster FFT processing when using GPU accelerated cuFFT library from Nvidia. GPU accelerated SVD algorithm is also available, for example CULA library by EM Photonics. However, current CULA library implementation does not support batch mode, so we will need to process all image frames as a sequence. Brief testing showed that 64 image frames (256*256) are processed even slower than CPU based version. Since we haven’t found any good alternative to CULA library we decided to implement our own GPU accelerated version of SVD algorithm.

Implementation

FFT part of image reconstruction when using cuFFT library was straightforward, however we had to deal with image frames which does not fit into available GPU memory. We had to write algorithm to run FFT over portions of the large data frame with subsequent aggregation. Figure 2.1 below shows case when all data fits into GPU memory.

Figure 2.1

Figure 2.2 illustrates the case when huge data is processed. Solid lines in figure below show measured performance, dashed lines show estimated time in case data fits into GPU memory.

Figure 2.2

Much more interesting was to implement GPU accelerated SVD algorithm with batch mode. All implementations we had found are focusing on maximum parallelization of a single SVD run, hence we had to change approach. Basically SVD algorithm consists of HouseHolder Reduction, QR Diagonalization and Back Transformation steps. All are iterative processes when next step depends on results from previous step. In case of small matrices each CUDA kernel can’t effectively utilize all parallel processing units of modern GPU. So we had to write kernels in a way when every iteration for all matrices is processed by a single kernel run. This way in case of 64 matrices with 128x128 size each we can process 64*128 elements at a time instead of 128. Figures 3.1 and 3.2 show performance comparison for CULA Library and our implementation.

Figure 3.1

Figure 3.2

With more than 8 frames per batch our implementation shows much better performance comparing to sequential CULA calls, although it is not so efficient for a single frame.

Results

As a result we have developed a pure C++ library with a set of specialized functions which perform various stages of image reconstruction. It requires only CUDA runtime libraries and free cuFFT library provided by NVidia. In addition we have implemented lightweight C# wrapper for convenient usage. Also we have run a lot of benchmarks with various GPU cards and on different platforms. On test cases provided by customer we received up to 150x speedup comparing to original single-threaded CPU implementation. However significant part of received speedup was due to poorly optimized original code which was completely rewritten and ported to CUDA whenever possible.

While it is usually understood what FFT stage does in image reconstruction, GRAPPA stage is not so obvious. Due to parallel acquisition of different frames arises distortion of acquired data which is effectively eliminated. Figure 4 shows visual representation of images before and after reconstruction.

Figure 4 The image before the reconstruction (left), image after reconstruction (right)

Additionally, you can find a case-study on ELEKS website or download it in PDF. Stay tuned!

/by Volodymyr Kozak, Principal Developer, ELEKS R&D

4/04/2013

20x performance for the same HW price: GPGPU-based Machine Learning at Yandex

Russian search-engine Yandex has disclosed some details about their machine-learning framework, FML. The most interesting detail is that it runs on 80 TFLOPS cluster powered by GPGPU. This is quite unusual application for GPU, as ML algorithms are usually hard to be paralleled. However they have managed to adapt their decision tree algorithm for high-level of parallelism. As a result Yandex has achieved more than 20x speed-up for the same hardware price.
They are going to upgrade their cluster to 300 TFLOPS. Yandex expects its cluster to be in the list of top 100 most powerful supercomputers in the world after that upgrade.

11/14/2012

NVIDIA Tesla K20 benchmark: facts, figures and some conclusions

Newest GPGPU flagman, Tesla K20 was announced by NVIDIA at Supercomputing conference in Salt Lake City yesterday (BTW, you can meet Roman Pavlyuk, ELEKS' CTO and Oleh Khoma, Head of HPC Unit there). Due to partnership with NVIDIA we got access to K20 couple of months ago and did lots of performance tests. Today we're going to tell you more about it's performance in comparison with several other NVIDIA accelerators that we have here at ELEKS.

Test environment

We implemented set of synthetic micro-benchmarks that measure performance of following basic GPGPU operations:

Host/Device kernel operations latency
Reduction time (SUM)
Dependent/Independent FLOPs
Memory management
Memory transfer speed
Device memory access speed
Pinned memory access speed

You can find more information and benchmark results below. Our set of tests is available on GitHub, so that you can run them on your hardware if you want. We ran these tests on seven different test configurations:

GeForce GTX 580 (PCIe-2, OS Windows, physical box)
GeForce GTX 680 (PCIe-2, OS Windows, physical box)
GeForce GTX 680 (PCIe-3, OS Windows, physical box)
Tesla K20Xm (PCIe-3, ECC ON, OS Linux, NVIDIA EAP server)
Tesla K20Xm (PCIe-3, ECC OFF, OS Linux, NVIDIA EAP server)
Tesla M2050 (PCIe-2, ECC ON, OS Linux, Amazon EC2)
Tesla M2050 (PCIe-2, ECC ON, OS Linux, PEER1 HPC Cloud)

One of the goals was to determine the difference between K20 and older hardware configurations in terms of overall system performance. Another goal: to understand the difference between virtualized and non-virtualized environments. Here is what we got:

Host/Device kernel operations latency

One of the new features of K20 is Dynamic Parallelism that allows you to execute kernels from each other. We did a benchmark that measure latency of kernel schedule and execution with and without DP. Results without DP look like that:

Surprisingly, new Tesla is slower than old one and GTX 680, probably because of the driver which was in beta version at the moment we measured performance. It is also obvious that AWS GPU instances are much slower than closer-to-hardware PEER1 ones, because of virtualization.

Then we tried to run similar benchmark with DP on:

Obviously we couldn't run these tests on older hardware because it doesn't support DP. Surprisingly, DP scheduling is slower than traditional one, but DP execution time is pretty much the same with ECC ON and traditional is faster with ECC OFF. We expected that DP latency would be less than traditional. It is hard to say what is the reason of such slowness. We suppose that it could be a driver, but it is just our assumption.

Reduction time (SUM)

Next thing we tried to measure was reduce execution time. Basically we calculated array sum. We did it with different arrays and grid sizes (Blocks x Threads x Array size):

Here we got expected results. New Tesla K20 is slower on small data sets, probably because of less clock frequency and not fully-fledged drivers. It becomes faster when we work with big arrays and use as many cores as possible.

Regarding virtualization, we found that virtualized M2050 is comparable with non-virtualized one on small data sets, but much slower on large data sets.

Dependent/Independent FLOPs

Peak theoretical performance is one of the most misunderstood properties of computing hardware. Some people says it means nothing, some says it is critical. The truth is always somewhere between these points. We tried to measure performance in FLOPs using several basic operations. We measured two types of operations, dependent and independent in order to determine if GPU does automatic parallelization of independent operations. Here's what we got:

Surprisingly, but we haven't got better results with independent operations. Probably we have some issue with our tests or misunderstood how does automatic parallelization work in GPU, but we couldn't implement the test where independent operations are automatically paralleled.
Regarding overall results, Teslas are much faster than GeForces when you work with double precision floating point numbers, which is expected: consumer accelerators are optimized for single precision because double is not required in computer games, primary software they were designed for. FLOPs are also highly dependent on clock speed and number of cores, so newer cards with more cores are usually faster, except of one case with GTX 580/680 and double precision: 580 is faster because of higher clock frequency.
Virtualization doesn't affect FLOPs performance at all.

Memory management

Another critical thing for HPC is basic memory management speed. As there are several memory models available in CUDA it is also critical to understand all the implications of using each of them. We wrote a test that allocate and release 16 b, 10 MB and 100 MB blocks of memory in different models. Please note: we got quite a different results in this benchmark, so it makes sense to show them on charts with logarithmic scale. Here they go:

Device memory is obviously the fastest option in case you allocate big chunk of memory. And GTX 680 with PCIe-3 is our champion in device memory management. Teslas are slower than GeForces in all the tests. Virtualization seriosly affects Host Write Combined memory management. PCIe-3 is better than PCIe-2 which is also obvious.

Memory transfer speed

Another important characteristics of an accelerator is speed of data transfer from one memory model to other. We measured it by copying 100 MB blocks of data between Host and GPU memory in both directions using regular, page locked and write combined memory access models. Here's what we got:

Obviously, PCIe-3 configurations are much faster than PCIe2. Kepler devices (GTX 680 and K20) are faster than other. If you use Page Locked and Write Combined models it makes your transfer speed faster. Virtualization slightly affects regular memory transfer speed, and doesn't affect others at all. We also tested internal memory transfer speed (please note, we haven't multiplied it by 2 as NVIDIA does usually in their tests):

Tesla K20s are faster than GeForce, but difference is not so big. M2050 are almost two times slower then their succesors.

Device memory access speed

We also measured device memory access speed for each configuration we have. Here they go:

Alligned memory access is way faster than non-aligned (almost 10 times difference). Newer accelerators are better than older. Double precicion read/write is faster than single for all the configurations. Virtualization doesn't affect memory access speed at all.

Pinned memory access speed

Last metric we measured was pinned memory access speed when device interacts with host memory. Unfortunately we weren't able to run these tests on GTX 680 with PCIe-3 due to issue with big memory blocks allocation in Windows.

New Tesla is faster then old one. PCIe-3 is obviously faster. Aligned access is almost ten times faster and if you read double precision floats your memory access speed is two times bigger than if you work with single precision floats. Virtualized environment is slower than non-virtualized.

Conclusions

All-in-all new Tesla K20 performs slightly faster than their predecessors. There is no revolution. There is evolution - we got better performance, new tools that make programmer's life easier. There also are several things that are not mentioned in this benchmark, like better support of virtualization and as a result cloud-readyness of K20. Some results were surprising. We expect better results of K20 in several months when new, optimized version of drivers will be available (NVIDIA always has some issues with new drivers just after release, but usually fix them after several updates).

You can find spreadsheet with complete results at Google Docs. Benchmark sources are available at our GitHub.

11/13/2012

Tesla K20 benchmark results

Recently we've developed a set of synthetic tests to measure NVIDIA GPU performance. We ran it on several test environments:

GTX 580 (PCIe-2, OS Windows, physical box)
GTX 680 (PCIe-2, OS Windows, physical box)
GTX 680 (PCIe-3, OS Windows, physical box)
Tesla K20Xm (PCIe-3, ECC ON, OS Linux, NVIDIA test data center)
Tesla K20Xm (PCIe-3, ECC OFF, OS Linux, NVIDIA test data center)
Tesla M2050 (PCIe-2, ECC ON, OS Linux, Amazon EC2)

Please note, that next generation Tesla K20 is also included into our results (many thanks to NVIDIA for their early access program).
You can find results at Google Docs. Benchmark sources are available at our GitHub account.
Stay tuned, we're going to make some updates on this.

UPD: Detailed results with charts and some conclusions: http://www.elekslabs.com/2012/11/nvidia-tesla-k20-benchmark-facts.html

6/07/2012

[GTC 2012] Effective HPC Architecture - Design, Develop, Implement (Presented by Oleh Khoma)

5/29/2012

GTC 2012

NVidia started their business by building graphical accelerators for video games and 3D designers. In 2007 they announced CUDA, a unified parallel computing architecture. 5 years ago it seemed that only crazy people could do some vital calculations on hardware designed for video games. Now building High Performance Computing accelerators is becoming core business for NVidia.
This year GPU Technology Conference (by the way, ELEKS was Silver Sponsor of GTC 2012) has shown us how important HPC business is for NVidia. They make lots of efforts to make it more and more usable by developers. They make GPUs more energy-efficient, they pack them with many new features, and they make them faster and faster. Check outstanding keynote by Jen-Hsun Huang for more details.
Oleh Khoma, Head of HPC unit at ELEKS was a speaker at GTC 2012. Here is his presentation:

Eleks gtc 2012

View more PowerPoint from ELEKS Software

Effective HPC system is so much more than just GPGPU. Real-world applications often need to stream large amounts of data from across system boundaries to the dozens of worker nodes in a most scalable and efficient way. They usually require storing huge amounts of data, scheduling of computation jobs, monitoring of system health and results visualization. Having first-hand experience in design, development and implementation of end-to-end HPC solutions, our engineers will share their experience on some of the pitfalls to avoid and things to consider when planning your next HPC system that works.

2/23/2012

DevTalks #1 presentations

Materials from our second DevTalks event (February 22, 2012).
1. Hadoop: the Big Answer to the Big Question of the Big Data (by Victor Haydin)

Video (in Ukrainian):

2. GPGPU: From 3D games to supercomputing (by Taras Shpot)
Slides: http://ge.tt/8fyqX0E/v/0 (hint: use 'n' and 'p' keys for navigation - feel the power of emacs)
Video (in Ukrainian):