ELEKS Labs: benchmark

Showing posts with label benchmark. Show all posts

1/29/2013

Cloud compute instances: it's not always about the horsepower

A short time ago we were consulting for one of our customers who considered migrating their application to the cloud. The system embodied a variety of computer vision algorithms, and one of the primary purposes of the back end services was detecting features in images and matching them against the feature database. The algorithms were both CPU- and memory-intensive, therefore one of our first steps involved benchmarking the recognition services on different Amazon EC2 instance types to find an optimal hardware configuration that could be efficiently utilized by the application.

So we launched a bunch of instances with varying computing capacity and gathered the initial results. To ensure complete utilization of hardware resources, we tried running 2, 4, 8, and even 16 benchmarks simultaneously on the same virtual machine.

So far, no surprises here. Obviously, Cluster Compute instances were no match for low-cost single-core workers. Also, we can clearly observe that performance was degrading significantly when the machine was running low on memory and the number of page faults was increasing (check out c1.xlarge with 4 and 8 concurrently running benchmarks).

On the other hand, to make the most out of the Cluster Compute instances, we have to load them heavily, otherwise we would be paying for idle time as a result of overprovisioned capacity. In many cases, providing enough load is a real issue: after all, not many tasks can make a dual-socket, 32-virtual-CPU machine cry. In our case, the only option was to launch more and more benchmarks simultaneously because running one benchmark wasn’t even close to reaching 100% utilization.

That got us thinking: what is the optimal configuration with respect to cloud infrastructure cost? In other words, how can we get the best bang for the buck in this situation? Taking the EC2 hourly rates into account, we built one more chart, and this time the results were much more interesting:

For our particular case, c1.medium and m3.xlarge instances, despite not having shown the best results with respect to running time, suddenly made it to the top 3 cost effective instance types, whereas powerful machines such as cc1.4xlarge and cc2.8xlarge displayed cost effectiveness only under a significant load.

Based on this simple case, three lessons can be learned:

Measurability matters. If you possess concrete figures about the performance results of your application, you can choose a better deployment strategy to minimize operational expenses.
Avoiding idle time on powerful machines with many logical CPUs can be difficult. Not all algorithms and implementations provide the necessary degree of parallelism to ensure efficient utilization of hardware.
If fast processing time is not critical for your product, consider using multiple nodes operating on commodity hardware as an alternative to operating on single high-end servers.

11/14/2012

NVIDIA Tesla K20 benchmark: facts, figures and some conclusions

Newest GPGPU flagman, Tesla K20 was announced by NVIDIA at Supercomputing conference in Salt Lake City yesterday (BTW, you can meet Roman Pavlyuk, ELEKS' CTO and Oleh Khoma, Head of HPC Unit there). Due to partnership with NVIDIA we got access to K20 couple of months ago and did lots of performance tests. Today we're going to tell you more about it's performance in comparison with several other NVIDIA accelerators that we have here at ELEKS.

Test environment

We implemented set of synthetic micro-benchmarks that measure performance of following basic GPGPU operations:

Host/Device kernel operations latency
Reduction time (SUM)
Dependent/Independent FLOPs
Memory management
Memory transfer speed
Device memory access speed
Pinned memory access speed

You can find more information and benchmark results below. Our set of tests is available on GitHub, so that you can run them on your hardware if you want. We ran these tests on seven different test configurations:

GeForce GTX 580 (PCIe-2, OS Windows, physical box)
GeForce GTX 680 (PCIe-2, OS Windows, physical box)
GeForce GTX 680 (PCIe-3, OS Windows, physical box)
Tesla K20Xm (PCIe-3, ECC ON, OS Linux, NVIDIA EAP server)
Tesla K20Xm (PCIe-3, ECC OFF, OS Linux, NVIDIA EAP server)
Tesla M2050 (PCIe-2, ECC ON, OS Linux, Amazon EC2)
Tesla M2050 (PCIe-2, ECC ON, OS Linux, PEER1 HPC Cloud)

One of the goals was to determine the difference between K20 and older hardware configurations in terms of overall system performance. Another goal: to understand the difference between virtualized and non-virtualized environments. Here is what we got:

Host/Device kernel operations latency

One of the new features of K20 is Dynamic Parallelism that allows you to execute kernels from each other. We did a benchmark that measure latency of kernel schedule and execution with and without DP. Results without DP look like that:

Surprisingly, new Tesla is slower than old one and GTX 680, probably because of the driver which was in beta version at the moment we measured performance. It is also obvious that AWS GPU instances are much slower than closer-to-hardware PEER1 ones, because of virtualization.

Then we tried to run similar benchmark with DP on:

Obviously we couldn't run these tests on older hardware because it doesn't support DP. Surprisingly, DP scheduling is slower than traditional one, but DP execution time is pretty much the same with ECC ON and traditional is faster with ECC OFF. We expected that DP latency would be less than traditional. It is hard to say what is the reason of such slowness. We suppose that it could be a driver, but it is just our assumption.

Reduction time (SUM)

Next thing we tried to measure was reduce execution time. Basically we calculated array sum. We did it with different arrays and grid sizes (Blocks x Threads x Array size):

Here we got expected results. New Tesla K20 is slower on small data sets, probably because of less clock frequency and not fully-fledged drivers. It becomes faster when we work with big arrays and use as many cores as possible.

Regarding virtualization, we found that virtualized M2050 is comparable with non-virtualized one on small data sets, but much slower on large data sets.

Dependent/Independent FLOPs

Peak theoretical performance is one of the most misunderstood properties of computing hardware. Some people says it means nothing, some says it is critical. The truth is always somewhere between these points. We tried to measure performance in FLOPs using several basic operations. We measured two types of operations, dependent and independent in order to determine if GPU does automatic parallelization of independent operations. Here's what we got:

Surprisingly, but we haven't got better results with independent operations. Probably we have some issue with our tests or misunderstood how does automatic parallelization work in GPU, but we couldn't implement the test where independent operations are automatically paralleled.
Regarding overall results, Teslas are much faster than GeForces when you work with double precision floating point numbers, which is expected: consumer accelerators are optimized for single precision because double is not required in computer games, primary software they were designed for. FLOPs are also highly dependent on clock speed and number of cores, so newer cards with more cores are usually faster, except of one case with GTX 580/680 and double precision: 580 is faster because of higher clock frequency.
Virtualization doesn't affect FLOPs performance at all.

Memory management

Another critical thing for HPC is basic memory management speed. As there are several memory models available in CUDA it is also critical to understand all the implications of using each of them. We wrote a test that allocate and release 16 b, 10 MB and 100 MB blocks of memory in different models. Please note: we got quite a different results in this benchmark, so it makes sense to show them on charts with logarithmic scale. Here they go:

Device memory is obviously the fastest option in case you allocate big chunk of memory. And GTX 680 with PCIe-3 is our champion in device memory management. Teslas are slower than GeForces in all the tests. Virtualization seriosly affects Host Write Combined memory management. PCIe-3 is better than PCIe-2 which is also obvious.

Memory transfer speed

Another important characteristics of an accelerator is speed of data transfer from one memory model to other. We measured it by copying 100 MB blocks of data between Host and GPU memory in both directions using regular, page locked and write combined memory access models. Here's what we got:

Obviously, PCIe-3 configurations are much faster than PCIe2. Kepler devices (GTX 680 and K20) are faster than other. If you use Page Locked and Write Combined models it makes your transfer speed faster. Virtualization slightly affects regular memory transfer speed, and doesn't affect others at all. We also tested internal memory transfer speed (please note, we haven't multiplied it by 2 as NVIDIA does usually in their tests):

Tesla K20s are faster than GeForce, but difference is not so big. M2050 are almost two times slower then their succesors.

Device memory access speed

We also measured device memory access speed for each configuration we have. Here they go:

Alligned memory access is way faster than non-aligned (almost 10 times difference). Newer accelerators are better than older. Double precicion read/write is faster than single for all the configurations. Virtualization doesn't affect memory access speed at all.

Pinned memory access speed

Last metric we measured was pinned memory access speed when device interacts with host memory. Unfortunately we weren't able to run these tests on GTX 680 with PCIe-3 due to issue with big memory blocks allocation in Windows.

New Tesla is faster then old one. PCIe-3 is obviously faster. Aligned access is almost ten times faster and if you read double precision floats your memory access speed is two times bigger than if you work with single precision floats. Virtualized environment is slower than non-virtualized.

Conclusions

All-in-all new Tesla K20 performs slightly faster than their predecessors. There is no revolution. There is evolution - we got better performance, new tools that make programmer's life easier. There also are several things that are not mentioned in this benchmark, like better support of virtualization and as a result cloud-readyness of K20. Some results were surprising. We expect better results of K20 in several months when new, optimized version of drivers will be available (NVIDIA always has some issues with new drivers just after release, but usually fix them after several updates).

You can find spreadsheet with complete results at Google Docs. Benchmark sources are available at our GitHub.

11/13/2012

Tesla K20 benchmark results

Recently we've developed a set of synthetic tests to measure NVIDIA GPU performance. We ran it on several test environments:

GTX 580 (PCIe-2, OS Windows, physical box)
GTX 680 (PCIe-2, OS Windows, physical box)
GTX 680 (PCIe-3, OS Windows, physical box)
Tesla K20Xm (PCIe-3, ECC ON, OS Linux, NVIDIA test data center)
Tesla K20Xm (PCIe-3, ECC OFF, OS Linux, NVIDIA test data center)
Tesla M2050 (PCIe-2, ECC ON, OS Linux, Amazon EC2)

Please note, that next generation Tesla K20 is also included into our results (many thanks to NVIDIA for their early access program).
You can find results at Google Docs. Benchmark sources are available at our GitHub account.
Stay tuned, we're going to make some updates on this.

UPD: Detailed results with charts and some conclusions: http://www.elekslabs.com/2012/11/nvidia-tesla-k20-benchmark-facts.html

8/01/2012

MPI or not MPI - that is the question

One of the most popular questions to us at latest GPU Technology Conference was "Why wouldn't we use MPI instead of writing our own middleware?". The short answer is "Because home-made middleware is better for projects we did". If you are not satisfied with such explanation - the longer answer follows.

MPI vs. custom communication library

First of all, one should always keep in mind that MPI is universal framework, built for general-purpose HPC. It is really good for, lets say, academic HPC where you have some calculation that you need to run only once, get results and forget about your program. But in case if you have commercial HPC cluster, designed to solve some particular problem many times (let's say, do some kind of simulation using Monte-Carlo method), you should be able to optimize every single component of your system. Just to make sure that your hardware utilization rate is high enough to make your system cost-efficient. With your own code-base you can make network communications as fast as possible without any limitations. And what is very important you can keep this code simple and easy for understanding - which is not always possible with general-purpose frameworks like MPI.

But what about complexity of your own network library? Well, it is not so complex as you could imagine. Some tasks (like Monte-Carlo simulations) are embarrassingly parallel, so that you don't need complex interactions between your nodes. You have coordinator that sends task to workers and then aggregate results from them (see our GTC presentation for more details about that arhcitecture). It is relatively easy to implement lightweight messaging library with raw sockets, you just need good enough software engineer for that task.

And last, but definitely not least: lightweight solution, written to solve some particular problem is much faster and predictable than universal tools like MPI.

Benchmark

Our engineers compared performance of our network code with Open MPI on Ubuntu and Intel MPI on CentOS (for some reasons Intel MPI refused to work on Ubuntu). They tested multicast performance, because it is critical for architecture we use in our solutions. There were three benchmarks (described with kind of pseudo-code):

1. MPI point-to-point

if rank == 0:
 #master
 for j in 0..packets_count:
  for i in 1..procesess_count:
   MPI_Isend(…) #async send to slave processes
  for i in 1..procesess_count:
   MPI_Irecv(…) #async recv from slave processes
  for i in 1..procesess_count:
   MPI_Wait(…) #wait for send/recv complete 
else:
 #slave
 for j in 0..packets_count:
  MPI_Recv(…) #recv from master processes
  MPI_Send(…) #send to master processes

2. MPI broadcast

if rank == 0:
 #master
 for j in 0..packets_count:
  MPI_Bcast(…) #broadcast to all slave processes
  for i in 1..procesess_count:
   MPI_Irecv(…) #async recv from slave processes
  for i in 1..procesess_count:
   MPI_Wait(…) #wait for recv
else:
 #slave
 for j in 0..packets_count:
  MPI_Bcast(…) #recv broadcast message from master processes
  MPI_Send(…) #send to master processes

3. TCP point-to-point

#master 
controllers = [] 
for i in 1..procesess_count: #waiting for all slaves
 socket = tcp_accept_as_blob_socket(…)
 controllers.append(controller_t(socket), …)

for j in 0..packets_count:
 for i in 1..procesess_count:
  controllers[i].send(…) #async send to slave processes
 for i in 1..procesess_count:
  controllers[i].recv(…) #wait for recv from slave processes

#slave
socket = Tcp_connect_as_blob_socket(…)#connecting to master
for j in 0..packets_count:
 sock.read(…)#recv packet from master
 sock.write(…) #send to packet to master

We ran it with 10, 20, 40, 50, 100, 150 and 200 processes, by sending packets of size 8, 32, 64, 256, 1024 and 2048 bytes. Each test included 1000 packets.

Results

First of all, lets look at open-source MPI implementation results:

Open MPI @ Ubuntu, cluster of 3 nodes, 10 workers:

Open MPI @ Ubuntu, cluster of 3 nodes, 50 workers:

Open MPI @ Ubuntu, cluster of 3 nodes, 200 workers:

So, Open MPI is slower than our custom TCP messaging library in all tests. Another interesting thing, Open MPI broadcast sometimes is even slower than iterative point-to-point messaging with Open MPI.

Let's look at proprietary MPI implementation by Intel. For some reasons it didn't work on Ubuntu 11.04 we use on our test cluster, so we decided to do a benchmark on another cluster with CentOS. Please keep that fact in mind - you can't directly compare results of Open MPI and Intel MPI as we tested them on different hardware. Our main goal was to compare MPI and our TCP messaging library, so these results work for us. Another thing: Intel MPI broadcast didn't work for us, so we tested only point-to-point communication performance.

Intel MPI @ CentOS, cluster of 2 nodes, 10 workers:

Intel MPI @ CentOS, cluster of 2 nodes, 50 workers:

Intel MPI @ CentOS, cluster of 2 nodes, 200 workers:

Intel MPI is much more serious opponent for our library than Open MPI. It has 20-40% faster results on 10 workers configuration. It has comparable performance on 50 workers (sometimes faster). But on 200 workers it is 50% slower than our messaging library.

You can also download Excel spreadsheet with complete results.

Conclusions

In general, Open MPI doesn't fit requirements for middleware in our projects. It is slower than custom library and (what is even more important) it is quite unstable and unpredictable.
Intel MPI point-to-point messaging looks much more interesting on small clusters, but on large it becomes slow in comparison with custom library. We experienced problems with running it on Ubuntu and it might be a problem in case you want to use Intel MPI with that Linux distributive. Broadcast is unstable and hangs up.
So, sometimes decision to write your own communication library looks not so bad, right?