Newest GPGPU flagman, Tesla K20 was announced by NVIDIA at Supercomputing conference in Salt Lake City yesterday (BTW,
you can meet Roman Pavlyuk, ELEKS' CTO and Oleh Khoma, Head of HPC Unit there). Due to partnership with NVIDIA we got access to K20 couple of months ago and did lots of performance tests. Today we're going to tell you more about it's performance in comparison with several other NVIDIA accelerators that we have here at ELEKS.
Test environment
We implemented set of synthetic micro-benchmarks that measure performance of following basic GPGPU operations:
- Host/Device kernel operations latency
- Reduction time (SUM)
- Dependent/Independent FLOPs
- Memory management
- Memory transfer speed
- Device memory access speed
- Pinned memory access speed
You can find more information and benchmark results below. Our set of tests is available on
GitHub, so that you can run them on your hardware if you want. We ran these tests on seven different test configurations:
- GeForce GTX 580 (PCIe-2, OS Windows, physical box)
- GeForce GTX 680 (PCIe-2, OS Windows, physical box)
- GeForce GTX 680 (PCIe-3, OS Windows, physical box)
- Tesla K20Xm (PCIe-3, ECC ON, OS Linux, NVIDIA EAP server)
- Tesla K20Xm (PCIe-3, ECC OFF, OS Linux, NVIDIA EAP server)
- Tesla M2050 (PCIe-2, ECC ON, OS Linux, Amazon EC2)
- Tesla M2050 (PCIe-2, ECC ON, OS Linux, PEER1 HPC Cloud)
One of the goals was to determine the difference between K20 and older hardware configurations in terms of overall system performance. Another goal: to understand the difference between virtualized and non-virtualized environments. Here is what we got:
Host/Device kernel operations latency
One of the new features of K20 is Dynamic Parallelism that allows you to execute kernels from each other. We did a benchmark that measure latency of kernel schedule and execution with and without DP. Results without DP look like that:
Surprisingly, new Tesla is slower than old one and GTX 680, probably because of the driver which was in beta version at the moment we measured performance. It is also obvious that AWS GPU instances are much slower than closer-to-hardware PEER1 ones, because of virtualization.
Then we tried to run similar benchmark with DP on:
Obviously we couldn't run these tests on older hardware because it doesn't support DP. Surprisingly, DP scheduling is slower than traditional one, but DP execution time is pretty much the same with ECC ON and traditional is faster with ECC OFF. We expected that DP latency would be less than traditional. It is hard to say what is the reason of such slowness. We suppose that it could be a driver, but it is just our assumption.
Reduction time (SUM)
Next thing we tried to measure was reduce execution time. Basically we calculated array sum. We did it with different arrays and grid sizes (Blocks x Threads x Array size):
Here we got expected results. New Tesla K20 is slower on small data sets, probably because of less clock frequency and not fully-fledged drivers. It becomes faster when we work with big arrays and use as many cores as possible.
Regarding virtualization, we found that virtualized M2050 is comparable with non-virtualized one on small data sets, but much slower on large data sets.
Dependent/Independent FLOPs
Peak theoretical performance is one of the most misunderstood properties of computing hardware. Some people says it means nothing, some says it is critical. The truth is always somewhere between these points. We tried to measure performance in FLOPs using several basic operations. We measured two types of operations, dependent and independent in order to determine if GPU does automatic parallelization of independent operations. Here's what we got:
Surprisingly, but we haven't got better results with independent operations. Probably we have some issue with our tests or misunderstood how does automatic parallelization work in GPU, but we couldn't implement the test where independent operations are automatically paralleled.
Regarding overall results, Teslas are much faster than GeForces when you work with double precision floating point numbers, which is expected: consumer accelerators are optimized for single precision because double is not required in computer games, primary software they were designed for. FLOPs are also highly dependent on clock speed and number of cores, so newer cards with more cores are usually faster, except of one case with GTX 580/680 and double precision: 580 is faster because of higher clock frequency.
Virtualization doesn't affect FLOPs performance at all.
Memory management
Another critical thing for HPC is basic memory management speed. As there are several memory models available in CUDA it is also critical to understand all the implications of using each of them. We wrote a test that allocate and release 16 b, 10 MB and 100 MB blocks of memory in different models. Please note: we got quite a different results in this benchmark, so it makes sense to show them on charts with logarithmic scale. Here they go:
Device memory is obviously the fastest option in case you allocate big chunk of memory. And GTX 680 with PCIe-3 is our champion in device memory management. Teslas are slower than GeForces in all the tests. Virtualization seriosly affects Host Write Combined memory management. PCIe-3 is better than PCIe-2 which is also obvious.
Memory transfer speed
Another important characteristics of an accelerator is speed of data transfer from one memory model to other. We measured it by copying 100 MB blocks of data between Host and GPU memory in both directions using regular, page locked and write combined memory access models. Here's what we got:
Obviously, PCIe-3 configurations are much faster than PCIe2. Kepler devices (GTX 680 and K20) are faster than other. If you use Page Locked and Write Combined models it makes your transfer speed faster. Virtualization slightly affects regular memory transfer speed, and doesn't affect others at all. We also tested internal memory transfer speed (please note, we haven't multiplied it by 2 as NVIDIA does usually in their tests):
Tesla K20s are faster than GeForce, but difference is not so big. M2050 are almost two times slower then their succesors.
Device memory access speed
We also measured device memory access speed for each configuration we have. Here they go:
Alligned memory access is way faster than non-aligned (almost 10 times difference). Newer accelerators are better than older. Double precicion read/write is faster than single for all the configurations. Virtualization doesn't affect memory access speed at all.
Pinned memory access speed
Last metric we measured was pinned memory access speed when device interacts with host memory. Unfortunately we weren't able to run these tests on GTX 680 with PCIe-3 due to issue with big memory blocks allocation in Windows.
New Tesla is faster then old one. PCIe-3 is obviously faster. Aligned access is almost ten times faster and if you read double precision floats your memory access speed is two times bigger than if you work with single precision floats. Virtualized environment is slower than non-virtualized.
Conclusions
All-in-all new Tesla K20 performs slightly faster than their predecessors. There is no revolution. There is evolution - we got better performance, new tools that make programmer's life easier. There also are several things that are not mentioned in this benchmark, like better support of virtualization and as a result cloud-readyness of K20. Some results were surprising. We expect better results of K20 in several months when new, optimized version of drivers will be available (NVIDIA always has some issues with new drivers just after release, but usually fix them after several updates).
You can find spreadsheet with complete results at
Google Docs. Benchmark sources are available at our
GitHub.