Technical rants on distributed computing, high performance data management, etc. You are warned! A lot will be shameless promotion for VMWare products

Thursday, October 07, 2010

What is new in VMWare's vFabric GemFire 6.5?

Given the breadth of new capabilities, GemFire 6.5 might as well have been called 7.0. One of the important themes for us has been to make sure all stateful services can be partitioned for near linear scaling. With this release we go beyond partitioning in-memory data and application behavior. We can now manage data on disk in a highly partitioned manner and even process subscriptions with linear scaling. Our pursuit is simple - no matter which features of the product are being used, there is a level of assurance that the application will scale with increasing capacity.

In a nutshell(it is a big nut), here are some of the capabilities we introduced:

  • Database class reliability through Shared-nothing parallel persistence: A unique high performance design to pool disks across the cluster for storing and recovering data. GemFire always supported disk persistence and recovery for replicated data regions but now this capability has been extended for partitioned data also. The design principles adopted are fundamentally different than ones in typical clustered databases. For one, disk storage is shared nothing - each cache process owns its disk store eliminating process level contentions. Second, the design is tilted to favor memory i.e. there are no complex B-Tree data structures on disk; instead we assume complex query navigations will always be done through in-memory indexes. Third, the design uses rolling append-only log files to avoid disk seeks completely. Finally, the design preserves the rebalancing model in GemFire when capacity is increased or decreased - the disk data also relocates itself. Well, there is more to the story which I will cover once we walk through what else is new in 6.5..
  • Simplified and Intuitive programming model: First, we simplified the programming model by reducing some of the boiler plate bootstrapping code that was required in the past and introduced pre-packaged templates for common configurations and topologies. Second, we launched the new Spring-Gemfire project to introduce a clean programming model for the Spring developer. Note that Spring-GemFire is not bundled with GemFire 6.5. If you are already familiar with the GemFire APIs or just getting started, I would recommend going through the new tutorial that walks through some of the new  simplified programming APIs. And, not to worry - your existing application will continue to just run fine. The old APIs are fully supported.
  • Improved scale-out capabilities: Application deployments using the tiered model (client process embedding a local cache talking to a server farm) could see a 2X or more performance gain when accessing partitioned data. With 6.5, client processes gain knowledge about server side partitions and use it on the fly to direct traffic directly to the server with the required data set. Application clients subscribing to events using key based register interest or "continuous queries" now have their subscriptions registered on all the partitioned nodes. This allows each partition to process the subcription filters on the nodes where the data changes are applied dramatically reducing messaging traffic between peers unlike in the previous releases. The net effect is that more subscriptions can be processed and the event delivery latency to clients is also reduced.
  • Colocated transactions: If each logical partition were to own the entire transactional working set then highly applications can scale linearly if the concurrent transaction load is uniformly spread across the data set and hence across all the partitions. Each partition can coordinate its transaction without any impact to other partitions with no locking requirements across partitions. GemFire 6.5 introduces a change to the distributed transaction design to detect cases where the data is colocated and avoids engaging the built-in distribued lock service.
  • C++, C# client improvements: The clients can now receive and send object change "deltas", CQ processing on the client has improved, Parallel Function execution on the Grid is automatially HA, SSL and user level security support has been added and the client cache sizes can be significantly larger with support for 64-bit architectures.

For a complete list of features added in 6.5, click here

Next, I will rant and rave our disk persistence story.

Factors contributing to very high disk throughput:

Pooling: Like mentioned before, each cache instance manages its own disk store and there is no disk contention between processes. Each partition can locally manage its data on local disk(s). Assuming application "write" load can be uniformly balanced across the cluster, the aggregate disk throughput will be (Disk transfer rate * NumOfPartitions), assuming a single disk per partition. Disk transfer rates can be upto 100MB/sec on commodity machines today compared to just 2MB/sec in the 1980s. 


Avoid seeks: By managing most (or all) of the data in cluster memory all reads are served without navigating through BTree based indexes and data files on disk, which as you know will result in continuous seeking on disk. The average disk seek times today are still 2ms or higher. 

Buffered logging: When writes do occur, these operations are simply logged to disk in "append-only" log/data files (See figure above). Appending implies we can continuously write to consecutive sectors on disk without requiring disk head movement. Probably the most controversial decision we had to make was to allow all writes to only be flushed to the OS buffer rather than 'fsync' all the way to disk. The writes are buffered by the IO subsystem in the kernel allowing the IO scheduler to merge and sort disk writes to achieve the highest possible disk throughput. The write requests need not initiate any disk I/O until some time in the future. Thus, from the perspective of a user application, write requests stream at much higher speeds, unencumbered by the performance of the disk. Any data loss risks due to sudden failures at a hardware level is mitigated by having multiple nodes write in parallel to disk. In fact, it is assumed that hardware will fail, especially in large clusters and data centers and software needs to take into account such failures.  The system is designed to recover in parallel from disk and to guarantee data consistency when data copies on disk don't agree with each other. Each member of the distributed system logs membership changes to its persistent files and uses this during recovery to figure out the replica that has the latest changes and automatically synchronizes these changes at startup.

Motivations for native GemFire persistence instead of an RDB

Most data grid deployments today use a RDB as the backing store. A synchronous design where every change is reflected in the database first has obvious challenges for "write heavy" applications. You are only as fast as the RDB and the database could become the single point of failure (SPOF). Transaction execution complexity also increases involving the data grid and the database in a 2-phase commit protocol. 
So, one remedy is to execute all writes on the data grid and asynchronously propagate the changes to the RDB. This pattern also has the same SPOF challenges and not well suited for cases with sustained high write rates. 
Designs that go with Database "shards" - each cache instance writing to its own independent DB instance for scale is interesting but tough to actually implement with good HA characteristics. 

We believe that the GemFire parallel persistence model gets rid of all these limitations. Now, I am not advocating that you shouldn't move the data in the grid back to your classic RDB. You probably need to do this for all sorts of reasons as you pipe the information to upstream and downstream applications. But, think twice if your RDB is merely meant to be a backing store.
Even with customers that remain apprehensive with GemFire acting as the backing store, some want to still manage data on Gemfire disk stores just so that the cluster recovery can be fast. 

Handling failure conditions

The choice to use buffered logs meant we had to make sure the data is written to disk on multiple nodes for reliability. When partitions failed their disk stores can become stale. If later the entire cluster were to be bounced, the recovery logic has to make sure the state of the data in-memory and on disk reflected the latest state. We spent a lot of our energy in making sure the design for recovery always gaurantees consistency and freshness of data. 

In summary, the design is fundamentally very different from the traditional DB approach - data is stored on disk and memory is used to optimize disk IO (manage raw disk blocks instead of app data objects). Instead, in our thinking, data is primarily managed in memory and disk is primarily used for recovery and to address safety concerns. Disk capacity has increased more than 10000-fold over the last 25 years and seems likely to continue increasing in the future. Unfortunately, though, the access rate to information on disk has improved much more slowly: the transfer rate for large blocks has improved “only” 50-fold, and seek time and rotational latency have only improved by a factor of two.

You can read the product documentation on persistence here.

Spring-GemFire integration

We outlined a high level technology integration strategy to our customers and the press when we merged with SpringSource/VMWare on how GemFire will integrate with the spring framework (Java and .NET). The core Spring engineers have already delivered on part of the promise through the first milestone release making it natural for the Spring developer to access GemFire.
Among other things, the integration provides:

  1. Simpler ways to configure a cache and data regions. You can then inject the region into your app POJOs just like any spring dependency.
  2. Allow the Spring developer to use Spring transaction model - declarative configuration and doing transactions consistently across a variety of providers. So, basically, the application doesn't have to explicitly invoke the GFE transaction APIs. Of course, our txn caveats do apply.
  3. Wiring dependencies for callbacks easily: So, for instance, if you are using a CacheWriter or loader that needs DB URL, connection properties, etc you can now use the conventional Spring way of configuring DataSources and can be auto injected into your callback.
  4. ETC

I encourage you to read Costin Leau's blog for specific details, download and give it a try. Your feedback will be very valuable and much appreciated.

I hope to amend this blog post with further details on the various other "scale out" features in 6.5 soon. 
If there is enough interest, go through our community site on 6.5, download and try out the new tutorial. 


Anonymous said...

"Probably the most controversial decision we had to make was to allow all writes to only be flushed to the OS buffer rather than 'fsync' all the way to disk."

That would mean in case of a sudden loss of power, odds are that data will be lost in a highly used system. In other words, the data cannot be considered durable. Is there a way to change persistence to fsync mode when, say, my UPS signals danger?

Jags Ramnarayan said...

There is a configuration option for fsync'ing to disk. We are considering adding support so that optionally the persistence model can automatically become conservative when there is only one copy being written to disk due.