1/29/2013

Cloud compute instances: it's not always about the horsepower

A short time ago we were consulting for one of our customers who considered migrating their application to the cloud. The system embodied a variety of computer vision algorithms, and one of the primary purposes of the back end services was detecting features in images and matching them against the feature database. The algorithms were both CPU- and memory-intensive, therefore one of our first steps involved benchmarking the recognition services on different Amazon EC2 instance types to find an optimal hardware configuration that could be efficiently utilized by the application.


So we launched a bunch of instances with varying computing capacity and gathered the initial results. To ensure complete utilization of hardware resources, we tried running 2, 4, 8, and even 16 benchmarks simultaneously on the same virtual machine.


So far, no surprises here. Obviously, Cluster Compute instances were no match for low-cost single-core workers. Also, we can clearly observe that performance was degrading significantly when the machine was running low on memory and the number of page faults was increasing (check out c1.xlarge with 4 and 8 concurrently running benchmarks).

On the other hand, to make the most out of the Cluster Compute instances, we have to load them heavily, otherwise we would be paying for idle time as a result of overprovisioned capacity. In many cases, providing enough load is a real issue: after all, not many tasks can make a dual-socket, 32-virtual-CPU machine cry. In our case, the only option was to launch more and more benchmarks simultaneously because running one benchmark wasn’t even close to reaching 100% utilization.

That got us thinking: what is the optimal configuration with respect to cloud infrastructure cost? In other words, how can we get the best bang for the buck in this situation? Taking the EC2 hourly rates into account, we built one more chart, and this time the results were much more interesting:


For our particular case, c1.medium and m3.xlarge instances, despite not having shown the best results with respect to running time, suddenly made it to the top 3 cost effective instance types, whereas powerful machines such as cc1.4xlarge and cc2.8xlarge displayed cost effectiveness only under a significant load.

Based on this simple case, three lessons can be learned:
  • Measurability matters. If you possess concrete figures about the performance results of your application, you can choose a better deployment strategy to minimize operational expenses.
  • Avoiding idle time on powerful machines with many logical CPUs can be difficult. Not all algorithms and implementations provide the necessary degree of parallelism to ensure efficient utilization of hardware.
  • If fast processing time is not critical for your product, consider using multiple nodes operating on commodity hardware as an alternative to operating on single high-end servers.

Disruptor.NET


Disruptor is a framework for building high-performance applications that allows you to process large amount of messages concurrently without locks. It was initially invented by LMAX software development team members - Martin Thompson, Mike Barker and Dave Farley. It is said that with Disruptor you can achieve 100K transactions per second performance with less than 1 ms latency. I won't describe all Disruptor concepts as there are plenty of great articles about Disruptor in the Internet. My favorite ones are Martin Fowler's article, LMAX presentation at InfoQ and absolutely awesome series of blog-posts by Ruslan Cheremin (in Russian). Initially Disruptor is written with Java and released under Apache license. You can find source code along with some useful info and links on Github. There also is pretty good port to .NET CLR: disruptor-net (and today I'm going to tell more about it).

At the first glance Disruptor is yet another name for old good ring buffer. But if you look deeper there are several important features that make the difference. Well-written Disruptor-based applications can be incredibly fast. This speed is achieved by a great mix of high-level architecture choices and low-level optimizations based on deep knowledge of modern CPU and memory architecture. Almost all decisions by Disruptor authors were made in order to make applications cache-efficient, lock-free and GC friendly.

Today I'm going to show you how to write very simple Disruptor-based application. And just to make some difference with other introduction articles I'll use disruptor-net, so it will be Disruptor application written with C#. First of all, let's define the scope. Imagine there are lots of moving objects (e.g. cars). Each object has GPS installed on-board and sends its location via mobile network once per some period of time. We want to track the path of every single object. It is very important to log every single message to disk in order to have ability to restore all the movement history. We also want to aggregate the distance traveled by each object as well as the total distance traveled by all objects. Sounds like a very simple application, right?

Our moving object will have three properties: latitude, longitude and some unique ID. Something like that:

As I don't have a fleet of thousands moving objects with GPS sensors we will simulate their movement with very simple random simulator (see code on Github).

The simplest way to use disruptor in your .NET application is to install it via nuget package manager:

Let's create the Disruptor instance:

Here we pass factory method that preallocate the ring buffer with messages (do you remember that one of the main ideas of Disruptor is to reduce memory allocation?), set its size (1024, it must be a power of 2) and tell it to use default task scheduler. Now we need some handlers. Our handler graph will look like that:

Lets start with some simple persist handler that writes objects into text file, line by line:

As you can see it implements IEventHandler<T> interface. This interface has only one method void OnNext(T data, long sequence, bool endOfBatch). Disruptor call this method for each item in the ring buffer and pass the item as the first parameter. Second parameter is the sequence number for the item. Third parameter is the most interesting: as Disruptor allows batch processing it  indicates if current item is the last one in the batch. We use it in order reduce the number of flush calls - we call flush only when it is end of the batch. Leaping ahead, as ObjectPersistHandler is the slowest one in our pipeline, batch processing allowed us to achieve 2x performance boost in our benchmark. It looks like there is some cheating here, but in fact everything is ok - other handlers won't start processing before persist handler finish the batch, so there will be no situation when we process some message before it is stored to disk. Please note, we use simple string serialization here, in real-world application some more sophisticated serialization method will be used.
We have few other handlers: DistanceHandler, ConsoleLogHandler and AknowledgementHandler. First one is kind of business logic - it performs some distance calculation for objects:

I think this code is pretty straightforward, so no additional explanations are needed. The only thing you might notice here is that it doesn't use any locks or other synchronization primitives, as each handler is single threaded and doesn't share any mutable data with other handlers.

Aknowledgement handler actually tells simulator that we finished the processing and it is time to make next object move. In real world code there will be some kind of network message send where we will tell the client that its message was handled so it can send next one. After we move the object it is published into Disruptor using following code:

Please note, we do not store new object into the ring buffer, but copy its properties into the existing entry. It makes us sure that no data is shared between different threads and guarantee lower GC load.
The final thing we need to do is to configure disruptor to use our handlers. It could be done with fluent syntax like this:

Please note that 'Then()' method creates synchronization barrier, so we can guarantee that handlers will be executed strictly in required order. Console log handler and distance handler are included into one handler group, so they will be called in parallel threads.
The only thing we need is to start execution:

Let's measure the performance. On my work machine (Core i5@3.2 GHz, 8 GB RAM and regular spin HDD) I've got following results:

Of course in real-world app there will be some additional latency introduced by network communication, but still, 340K events per second is a pretty impressive throughput. Moreover, I think it could be much better in case we set the goal to achieve maximum possible performance here.

Disruptor gives you flexible way to create high-throughput and low-latency event processing applications. It is very simple but powerful tool with lots of great ideas inside and really good implementation. You can use .NET CLR port in case there are some reasons why you don't want to use original Java implementation.
As usual, you can find demo code on Github. Stay tuned!