Technical rants on distributed computing, high performance data management, etc. You are warned! A lot will be shameless promotion for VMWare products

Thursday, June 11, 2009

Challenges in throughput scaling in a partitioned design

Here is an interesting article from Billy Newport posted on highscalability.com

The gist of this article is a point that hits home for me - you cannot assume higher scaling (especially throughput) just by adding more partitions for managing your data.

The article correctly points out that one slow partition can throttle the system throughput for scatter-gather type queries that are parallelized on each partition. But, this assumes that all nodes are equal (or remain equal) from a load standpoint. I think, this ability to detect and handle "hotspots" is something that deserves a lot more attention than it gets. I suspect, there is this general myth that increase in the number of partitions will always result in higher throughput.

For instance, we at GemStone have seen the following interesting cases (+ more):
(1) 80-20 rule: Often, it is the 20% of the data that is lot more popular at any given time. So, you have a situation where too many clients are converging onto a few partitions creating imbalance. A whole bunch of your partitions are just idle and heavily underutilized. You can mitigate this to some extent by creating more replicas for the "hot" data, but, it can be difficult to predict when you need the extra copy and for how long.

(2) GC pauses:
Consider a case where a large number of clients are updating partitioned data deployed in a JVM based data grid/fabric. Sooner or later, you are exposed to the dreaded "full gc" cycle causing the incoming client to be paused. Say, your normal update request takes fraction of a millisecond and let us also assume that the client requests are uniformly balanced across all the partitions. Essentially, it is more or less guaranteed that all clients will vector to each partition atleast once every second. Now, a full GC in any one partition causes every single client to be now pause. Have a GC that takes a whole minute, and you have all clients now waiting. Worse, the moment this paused partition gets out of it, the next partition gets into the same situation and so on.

So, what do you do?

We at GemStone Systems think some of the problems can be mitigated by providing two key capabilities:
1) dynamic rebalancing: This is the ability for the system as a whole to adjust itself by relocating buckets (subset of data within a partition) to less loaded nodes. And, doing this in such a fashion so that there are no pauses are introduced.

2) Enough instrumentation within the system to proactively avoid hotspots:
GemFire captures a wide range of statistics on the query, update rate on any given partition, the average response times, CPU utilization, GC pause times, overall heap utilization with respect to a configurable threshold to reduce the probability of full GCs, etc. Applications get access to these statistics through a simple API and can then use a API to trigger rebalancing to offload some of the data from "hot" partitions to less loaded ones. Of course, there is no guarantee that the past throughput characterictics observed on any given partition to be representative of te future but the explicit control provides a way for the application developers and adminstrators to dictate what happens. Our sense is that they know best. We even provide a way for the application to simulate a rebalance to see the effect it will have rather than actually doing it.

Explore more on GemFire data fabric