8/28/2012

Amazon Glacier: why and how does it may work?


Recently Amazon announced Glacier - a new service in their AWS suite. It allows you to store large amounts of data very cheaply - just $0.01/Gb per month plus some expenses for traffic and API calls. It is incredible cheap price, compare  it with S3: $0.125 /Gb for first Tb of data. But where is the trick? Well, actually, there is one important detail: retrieval time for you data could be up to several hours.

Why do we need it?


Such a long retrieval time means that you can use it only for some data that should be stored probably for a long time, but there is no need to access it quickly. Consider some kind of historical financial data: in some countries there are government regulations that require financial institutions to store every single transaction for several years after it occurs. It turns out that most of these transactions would never been accessed. It could happen only in case of some investigation or system audit, which happens not very often. Nowadays most of these data is stored on some hard drives or even magnetic tapes and usually they are not connected to the network, so that retrieval time is also up to several hours. And that is the target market for Glacier.

Perito Moreno Glacier. Patagonia, Argentina (photo taken by Luca Galuzzi)

Amazon targets customers who want to store lots of data for a very long time, do not access it very often and quickly, but data should be stored in a very reliable way. Glacier offers you 99.999999999% durability. That's right, eleven nines - impressive reliability! It is very expensive to build such a reliable storage in-house, so only really big corporations had access to such a reliable storage in past. There are several services that adress same issue, but to be honest they don't look seriously enough to be enterprise vendors. Amazon is the first enterprise level vendor on this market.

How does it (may) work?


As a disclaimer: I am not an Amazon employee and there are no information about Glacier architecture available in any public sources. So, I can only imagine and suppose how does it actually may work.

Lets imagine that we want to build service like Glacier. First of all we would need lots of storage hardware. And it must be pretty cheap (in terms of cost per gigabyte) hardware, because we want to sell it for such a little amount of money. There are only two types of hardware that fit these requirements: hard disk drives and magnetic tape. Last one is much cheaper, but less reliable because of magnetic layer degradation. It means one should perform periodic data refresh to prevent data loss. They may use special custom hard drives with big capacity an slow access time, simply because speed is not critical for them. It makes overall solution cost even less. I don't know what kind of data storage hardware Amazon uses, but I think hard drives is little bit more possible option.

Second component of big data warehouses is infrastructure that connect users with their data and make it available for them in timeframe described in SLA. It could be network, power supplies, cooling and lots and lots of things you can find in modern datacenters. If you would build service like S3 infrastructure cost would be even bigger than storage cost. But here are one  important difference between S3 and Glacier: you don't have to provide access to data quickly. It means that you don't have to keep your hard drive turned on, which means reduced power consumption. It means that you don't even have to keep your hard drive plugged into server case! It could be stored in simple locker. And all you need is employee who is responsible for finding and plugging your hard drive into server when user asks access to data. And several hours are definitely enough to do it even for human being. Or little cute orange robot:



Sounds crazy? Well, lets look at this solution from the other side. What is Amazon, first of all? Cloud vendor? Nope. They are retail company. One of the biggest in the world. And they have probably the best logistics and warehouse infrastructure in the world. Lets imagine you order hard drive on Amazon web site. How much time does it usualy take for Amazon to find it in their warehouse, pack and send it to you? Several hours? Just imagine that they don't send it, but plug it into a server and turn on instead. Sounds like pretty similar task, isn't it?

It is amazing how Amazon integrates their businesses with each other. AWS was a side product of their main retail business. Product they started to sell just because they realized that it has value not only for their business, but also for other people. And now we can see how AWS uses offline infrastructure of Amazon to provide absolutely new kind of service. Fantastic fusion of online and offline infrastructures working together to create something new!

8/23/2012

EventStore Launch by Greg Young

We're proud to announce that one of project we've been working last couple of months is going to be presented to public. Greg Young will launch EventStore, distributed open-source storage for events at SkillsMatter eXchange London, September 17. Join Greg and his team from ELEKS there.

8/01/2012

MPI or not MPI - that is the question

One of the most popular questions to us at latest GPU Technology Conference was "Why wouldn't we use MPI instead of writing our own middleware?".  The short answer is "Because home-made middleware is better for projects we did". If you are not satisfied with such explanation - the longer answer follows.

MPI vs. custom communication library


First of all, one should always keep in mind that MPI is universal framework, built for general-purpose HPC. It is really good for, lets say, academic HPC where you have some calculation that you need to run only once, get results and forget about your program. But in case if you have commercial HPC cluster, designed to solve some particular problem many times (let's say, do some kind of simulation using Monte-Carlo method), you should be able to optimize every single component of your system. Just to make sure that your hardware utilization rate is high enough to make your system cost-efficient. With your own code-base you can make network communications as fast as possible without any limitations. And what is very important you can keep this code simple and easy for understanding - which is not always possible with general-purpose frameworks like MPI. 

But what about complexity of your own network library? Well, it is not so complex as you could imagine. Some tasks (like Monte-Carlo simulations) are embarrassingly parallel, so that you don't need complex interactions between your nodes. You have coordinator that sends task to workers and then aggregate results from them (see our GTC presentation for more details about that arhcitecture). It is relatively easy to implement lightweight messaging library with raw sockets, you just need good enough software engineer for that task. 

And last, but definitely not least: lightweight solution, written to solve some particular problem is much faster and predictable than universal tools like MPI.

Benchmark


Our engineers compared performance of our network code with Open MPI on Ubuntu and Intel MPI on CentOS (for some reasons Intel MPI refused to work on Ubuntu). They tested multicast performance, because it is critical for architecture we use in our solutions. There were three benchmarks (described with kind of pseudo-code): 
1. MPI point-to-point
if rank == 0:
 #master
 for j in 0..packets_count:
  for i in 1..procesess_count:
   MPI_Isend() #async send to slave processes
  for i in 1..procesess_count:
   MPI_Irecv() #async recv from slave processes
  for i in 1..procesess_count:
   MPI_Wait() #wait for send/recv complete 
else:
 #slave
 for j in 0..packets_count:
  MPI_Recv() #recv from master processes
  MPI_Send() #send to master processes
2. MPI broadcast 
if rank == 0:
 #master
 for j in 0..packets_count:
  MPI_Bcast() #broadcast to all slave processes
  for i in 1..procesess_count:
   MPI_Irecv() #async recv from slave processes
  for i in 1..procesess_count:
   MPI_Wait() #wait for recv
else:
 #slave
 for j in 0..packets_count:
  MPI_Bcast() #recv broadcast message from master processes
  MPI_Send() #send to master processes
3. TCP point-to-point
#master 
controllers = [] 
for i in 1..procesess_count: #waiting for all slaves
 socket = tcp_accept_as_blob_socket()
 controllers.append(controller_t(socket), )

for j in 0..packets_count:
 for i in 1..procesess_count:
  controllers[i].send() #async send to slave processes
 for i in 1..procesess_count:
  controllers[i].recv() #wait for recv from slave processes

#slave
socket = Tcp_connect_as_blob_socket()#connecting to master
for j in 0..packets_count:
 sock.read()#recv packet from master
 sock.write() #send to packet to master 

We ran it with 10, 20, 40, 50, 100, 150 and 200 processes, by sending packets of size 8, 32, 64, 256, 1024 and 2048 bytes. Each test included 1000 packets.

Results


First of all, lets look at open-source MPI implementation results:
Open MPI @ Ubuntu, cluster of 3 nodes, 10 workers:

Open MPI @ Ubuntu, cluster of 3 nodes, 50  workers:

Open MPI @ Ubuntu, cluster of 3 nodes, 200  workers:
So, Open MPI is slower than our custom TCP messaging library in all tests. Another interesting thing, Open MPI broadcast sometimes is even slower than iterative point-to-point messaging with Open MPI. 

Let's look at proprietary MPI implementation by Intel. For some reasons it didn't work on Ubuntu 11.04 we use on our test cluster, so we decided to do a benchmark on another cluster with CentOS. Please keep that fact in mind - you can't directly compare results of Open MPI and Intel MPI as we tested them on different hardware. Our main goal was to compare MPI and our TCP messaging library, so these results work for us. Another thing: Intel MPI broadcast didn't work for us, so we tested only point-to-point communication performance.
Intel MPI @ CentOS, cluster of 2 nodes, 10 workers:

Intel MPI @ CentOS, cluster of 2 nodes, 50 workers:

Intel MPI @ CentOS, cluster of 2 nodes, 200 workers:

Intel MPI is much more serious opponent for our library than Open MPI. It has 20-40% faster results on 10 workers configuration. It has comparable performance on 50 workers (sometimes faster). But on 200 workers it is 50% slower than our messaging library. 
You can also download Excel spreadsheet with complete results.

Conclusions


In general, Open MPI doesn't fit requirements for middleware in our projects. It is slower than custom library and (what is even more important) it is quite unstable and unpredictable.
Intel MPI point-to-point messaging looks much more interesting on small clusters, but on large it becomes slow in comparison with custom library. We experienced problems with running it on Ubuntu and it might be a problem in case you want to use Intel MPI with that Linux distributive. Broadcast is unstable and hangs up.
So, sometimes decision to write your own communication library looks not so bad, right?