Newest GPGPU flagman, Tesla K20 was announced by NVIDIA at Supercomputing conference in Salt Lake City yesterday (BTW, you can meet Roman Pavlyuk, ELEKS' CTO and Oleh Khoma, Head of HPC Unit there). Due to partnership with NVIDIA we got access to K20 couple of months ago and did lots of performance tests. Today we're going to tell you more about it's performance in comparison with several other NVIDIA accelerators that we have here at ELEKS.
Test environment
We implemented set of synthetic micro-benchmarks that measure performance of following basic GPGPU operations:
- Host/Device kernel operations latency
- Reduction time (SUM)
- Dependent/Independent FLOPs
- Memory management
- Memory transfer speed
- Device memory access speed
- Pinned memory access speed
You can find more information and benchmark results below. Our set of tests is available on GitHub, so that you can run them on your hardware if you want. We ran these tests on seven different test configurations:
- GeForce GTX 580 (PCIe-2, OS Windows, physical box)
- GeForce GTX 680 (PCIe-2, OS Windows, physical box)
- GeForce GTX 680 (PCIe-3, OS Windows, physical box)
- Tesla K20Xm (PCIe-3, ECC ON, OS Linux, NVIDIA EAP server)
- Tesla K20Xm (PCIe-3, ECC OFF, OS Linux, NVIDIA EAP server)
- Tesla M2050 (PCIe-2, ECC ON, OS Linux, Amazon EC2)
- Tesla M2050 (PCIe-2, ECC ON, OS Linux, PEER1 HPC Cloud)
One of the goals was to determine the difference between K20 and older hardware configurations in terms of overall system performance. Another goal: to understand the difference between virtualized and non-virtualized environments. Here is what we got:
Host/Device kernel operations latency
One of the new features of K20 is Dynamic Parallelism that allows you to execute kernels from each other. We did a benchmark that measure latency of kernel schedule and execution with and without DP. Results without DP look like that:
Surprisingly, new Tesla is slower than old one and GTX 680, probably because of the driver which was in beta version at the moment we measured performance. It is also obvious that AWS GPU instances are much slower than closer-to-hardware PEER1 ones, because of virtualization.
Then we tried to run similar benchmark with DP on:
Obviously we couldn't run these tests on older hardware because it doesn't support DP. Surprisingly, DP scheduling is slower than traditional one, but DP execution time is pretty much the same with ECC ON and traditional is faster with ECC OFF. We expected that DP latency would be less than traditional. It is hard to say what is the reason of such slowness. We suppose that it could be a driver, but it is just our assumption.
Reduction time (SUM)
Next thing we tried to measure was reduce execution time. Basically we calculated array sum. We did it with different arrays and grid sizes (Blocks x Threads x Array size):
Here we got expected results. New Tesla K20 is slower on small data sets, probably because of less clock frequency and not fully-fledged drivers. It becomes faster when we work with big arrays and use as many cores as possible.
Regarding virtualization, we found that virtualized M2050 is comparable with non-virtualized one on small data sets, but much slower on large data sets.
Dependent/Independent FLOPs
Peak theoretical performance is one of the most misunderstood properties of computing hardware. Some people says it means nothing, some says it is critical. The truth is always somewhere between these points. We tried to measure performance in FLOPs using several basic operations. We measured two types of operations, dependent and independent in order to determine if GPU does automatic parallelization of independent operations. Here's what we got:
Surprisingly, but we haven't got better results with independent operations. Probably we have some issue with our tests or misunderstood how does automatic parallelization work in GPU, but we couldn't implement the test where independent operations are automatically paralleled.
Regarding overall results, Teslas are much faster than GeForces when you work with double precision floating point numbers, which is expected: consumer accelerators are optimized for single precision because double is not required in computer games, primary software they were designed for. FLOPs are also highly dependent on clock speed and number of cores, so newer cards with more cores are usually faster, except of one case with GTX 580/680 and double precision: 580 is faster because of higher clock frequency.
Virtualization doesn't affect FLOPs performance at all.
Memory management
Another critical thing for HPC is basic memory management speed. As there are several memory models available in CUDA it is also critical to understand all the implications of using each of them. We wrote a test that allocate and release 16 b, 10 MB and 100 MB blocks of memory in different models. Please note: we got quite a different results in this benchmark, so it makes sense to show them on charts with logarithmic scale. Here they go:
Device memory is obviously the fastest option in case you allocate big chunk of memory. And GTX 680 with PCIe-3 is our champion in device memory management. Teslas are slower than GeForces in all the tests. Virtualization seriosly affects Host Write Combined memory management. PCIe-3 is better than PCIe-2 which is also obvious.
Memory transfer speed
Another important characteristics of an accelerator is speed of data transfer from one memory model to other. We measured it by copying 100 MB blocks of data between Host and GPU memory in both directions using regular, page locked and write combined memory access models. Here's what we got:
Obviously, PCIe-3 configurations are much faster than PCIe2. Kepler devices (GTX 680 and K20) are faster than other. If you use Page Locked and Write Combined models it makes your transfer speed faster. Virtualization slightly affects regular memory transfer speed, and doesn't affect others at all. We also tested internal memory transfer speed (please note, we haven't multiplied it by 2 as NVIDIA does usually in their tests):
Tesla K20s are faster than GeForce, but difference is not so big. M2050 are almost two times slower then their succesors.
Device memory access speed
We also measured device memory access speed for each configuration we have. Here they go:
Alligned memory access is way faster than non-aligned (almost 10 times difference). Newer accelerators are better than older. Double precicion read/write is faster than single for all the configurations. Virtualization doesn't affect memory access speed at all.
Pinned memory access speed
Last metric we measured was pinned memory access speed when device interacts with host memory. Unfortunately we weren't able to run these tests on GTX 680 with PCIe-3 due to issue with big memory blocks allocation in Windows.
I'm interested in your memory bandwidth numbers for K20. The results you see (9 to 10GB/sec) indicate PCI-E 3.0.
ReplyDeleteHowever, NVIDIA released the K20 as a PCI-E 2.0 product. Do you think your early-access system was running the cards as 3.0?
Confirmed by NVIDIA that the card was being prepared as PCIe 3.0, but shortly before the launch due to various reasons the card was downgraded to PCIe 2. Our tests are for early K20 supporting PCIe 3.0.
DeleteGood point. The memory transfer speed is indeed on par with GTX680 from our tests. It might be that early engineering samples were actually working on PCIe 3.0. I have asked NVIDIA whether this is true. Will post here when get any answer.
ReplyDeleteCrazy idea. The card name was reported on EAP cluster as K20xm. Should we expect a new card in the product line any time soon supporting PCIe 3.0?? Err, probably not. NVIDIA says they couldn't make it to work on Sandy Bridge with PCIe3, so perhapse "m" was an early name dropped before the release.
Regarding your comments on FLOPS - consumer cards are not "optimized for SP", but rather they are limited in DP. In the case of the GTX 680, it is limited by design to 1/16 the speed of SP, in order to protect the compute market for the Tesla cards.
ReplyDeleteLet's admit it - overal gaming users do not need DP performance. They care about frame rate and graphics goodies, and for these tasks the card must excel in single precision. On gaming market AMD is still quite strong and it is a huge market for NVIDIA to deeply care about it. GTX 680 and the family are very good in what they do.
DeleteThat said, Tesla users are not going to use GTX 680 in the near future, and this is not only due to non-existing DP. If you look at Tesla K10 card, its DP FLOPS are virtually non-existing. The main things Tesla buyers buy Tesla are:
- warranty and support
- ECC memory
- passive cooling - designed for server racks
I agree that the Tesla cards are hell of a lot more expensive than GTX series, but without the features I have listed above, even if GTX had DP, the Tesla users would not switch instanteniously.
I might have been carried away, but all I wanted to say is that Tesla and GTX markets are too different to consider DP being an only segmentation point. Do you agree?
These are great benchmarks, thanks! I got the code running on my 580 with ease, and I would like to spin up an Amazon instance to try it out there. Could you point me to an image that you are already using?
ReplyDeleteWe are using custom private image based on ami-02f54a6b. You should be able to build and run benchmarks using any HPC GPU base AMI (Amazon AMI is preferable)
DeleteGreat thanks! While looking for that image, I was just curious what region it is from?
DeleteUS-East-1. Actually, ami-02f54a6b is "Cluster GPU Amazon Linux AMI 2012.09" from Quick Start list in the launch EC2 instance wizard. It is really easy to find it.
Delete