MPI vs. custom communication library
First of all, one should always keep in mind that MPI is universal framework, built for general-purpose HPC. It is really good for, lets say, academic HPC where you have some calculation that you need to run only once, get results and forget about your program. But in case if you have commercial HPC cluster, designed to solve some particular problem many times (let's say, do some kind of simulation using Monte-Carlo method), you should be able to optimize every single component of your system. Just to make sure that your hardware utilization rate is high enough to make your system cost-efficient. With your own code-base you can make network communications as fast as possible without any limitations. And what is very important you can keep this code simple and easy for understanding - which is not always possible with general-purpose frameworks like MPI.
But what about complexity of your own network library? Well, it is not so complex as you could imagine. Some tasks (like Monte-Carlo simulations) are embarrassingly parallel, so that you don't need complex interactions between your nodes. You have coordinator that sends task to workers and then aggregate results from them (see our GTC presentation for more details about that arhcitecture). It is relatively easy to implement lightweight messaging library with raw sockets, you just need good enough software engineer for that task.
And last, but definitely not least: lightweight solution, written to solve some particular problem is much faster and predictable than universal tools like MPI.
Our engineers compared performance of our network code with Open MPI on Ubuntu and Intel MPI on CentOS (for some reasons Intel MPI refused to work on Ubuntu). They tested multicast performance, because it is critical for architecture we use in our solutions. There were three benchmarks (described with kind of pseudo-code):
Benchmark
Our engineers compared performance of our network code with Open MPI on Ubuntu and Intel MPI on CentOS (for some reasons Intel MPI refused to work on Ubuntu). They tested multicast performance, because it is critical for architecture we use in our solutions. There were three benchmarks (described with kind of pseudo-code):
1. MPI point-to-point
if rank == 0: #master for j in 0..packets_count: for i in 1..procesess_count: MPI_Isend(…) #async send to slave processes for i in 1..procesess_count: MPI_Irecv(…) #async recv from slave processes for i in 1..procesess_count: MPI_Wait(…) #wait for send/recv complete else: #slave for j in 0..packets_count: MPI_Recv(…) #recv from master processes MPI_Send(…) #send to master processes
2. MPI broadcast
if rank == 0: #master for j in 0..packets_count: MPI_Bcast(…) #broadcast to all slave processes for i in 1..procesess_count: MPI_Irecv(…) #async recv from slave processes for i in 1..procesess_count: MPI_Wait(…) #wait for recv else: #slave for j in 0..packets_count: MPI_Bcast(…) #recv broadcast message from master processes MPI_Send(…) #send to master processes
3. TCP point-to-point
#master controllers = [] for i in 1..procesess_count: #waiting for all slaves socket = tcp_accept_as_blob_socket(…) controllers.append(controller_t(socket), …) for j in 0..packets_count: for i in 1..procesess_count: controllers[i].send(…) #async send to slave processes for i in 1..procesess_count: controllers[i].recv(…) #wait for recv from slave processes #slave socket = Tcp_connect_as_blob_socket(…)#connecting to master for j in 0..packets_count: sock.read(…)#recv packet from master sock.write(…) #send to packet to master
We ran it with 10, 20, 40, 50, 100, 150 and 200 processes, by sending packets of size 8, 32, 64, 256, 1024 and 2048 bytes. Each test included 1000 packets.
Results
First of all, lets look at open-source MPI implementation results:
Open MPI @ Ubuntu, cluster of 3 nodes, 10 workers:
Open MPI @ Ubuntu, cluster of 3 nodes, 50
workers:
Open MPI @ Ubuntu, cluster of 3 nodes, 200
workers:
So, Open MPI is slower than our custom TCP messaging library in all tests. Another interesting thing, Open MPI broadcast sometimes is even slower than iterative point-to-point messaging with Open MPI.
Let's look at proprietary MPI implementation by Intel. For some reasons it didn't work on Ubuntu 11.04 we use on our test cluster, so we decided to do a benchmark on another cluster with CentOS. Please keep that fact in mind - you can't directly compare results of Open MPI and Intel MPI as we tested them on different hardware. Our main goal was to compare MPI and our TCP messaging library, so these results work for us. Another thing: Intel MPI broadcast didn't work for us, so we tested only point-to-point communication performance.
Intel MPI @ CentOS, cluster of 2 nodes, 10 workers:
Intel MPI @ CentOS, cluster of 2 nodes, 50 workers:
Intel MPI @ CentOS, cluster of 2 nodes, 200 workers:
Intel MPI is much more serious opponent for our library than Open MPI. It has 20-40% faster results on 10 workers configuration. It has comparable performance on 50 workers (sometimes faster). But on 200 workers it is 50% slower than our messaging library.
You can also download Excel spreadsheet with complete results.
Conclusions
In general, Open MPI doesn't fit requirements for middleware in our projects. It is slower than custom library and (what is even more important) it is quite unstable and unpredictable.
Intel MPI point-to-point messaging looks much more interesting on small clusters, but on large it becomes slow in comparison with custom library. We experienced problems with running it on Ubuntu and it might be a problem in case you want to use Intel MPI with that Linux distributive. Broadcast is unstable and hangs up.
So, sometimes decision to write your own communication library looks not so bad, right?
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.