Technical rants on distributed computing, high performance data management, etc. You are warned! A lot will be shameless promotion for VMWare products

Sunday, November 14, 2010

Announcing vFabric GemFire HTTP session management module for Tomcat/TCServer

Clustering of HTTP Sessions has been around for a while. I suspect many of you are already wondering how is this better or unique compared to the built-in clustering mechanisms in application servers or third party solutions like EHCache, Oracle Coherence, etc. Several of our customers already manage sessions in GemFire explicitly using our API. The need to effectively manage large quantities of session state is too great not to use a distributed data grid solution such as GemFire, which can partition session state in memory across the cluster for scale and maintain synchronous copies of session state in memory on multiple nodes to provide high availability (HA).

So, what is new? This announcement brings clean, pluggable integration of GemFire within Tomcat and SpringSource tc Server environments. HTTP session management is enabled through very simple configuration steps. Unlike other competitive products, there is no vendor-specificonfiguration of  "cache" XML in most cases. If you understand your application volume and scale requirements then it might just be a matter of configuring one of the pre-packaged templates when you start up tc Server instances.

You can read about the features, download and give it a try from our community site.

But, there are several interesting patterns that go beyond this basic value proposition and I will attempt to cover these below ...

Support for multi site clustering: Today, HA for sessions implies copies maintained on 2 or more servers that are part of the same cluster. If the entire cluster were to fail or becomes unreachable (say a network split occurs) your applications is likely going to fail over to some DR cluster but without access to the sessions. Wouldn't it be nice if your session state survives such failure conditions?  With Gemfire, you can configure the session state to be replicated over what we call "WAN gateways"(See section "multi site setup") to your DR cluster. The replication is asynchronous with support for batching and conflation.

Sessions are getting increasingly obese and may never die: Session state often reference personalization information - things like preferences, past transactions and even things like past chat sessions with customer support. Traditionally, all this data was only managed in the RDB and fetched every single time the user logs in - one of the main causes of database overload. Increasingly, I see sessions that range in size from a few KB to several MB each. These are represented by convoluted object graphs and when changes occur, they represent only a minuscule fraction of the entire session state. Managing complex, constantly changing and ever growing session state requires special consideration:
  1. when sessions change the session replication layer needs to be smart about only replicating the changes and not the entire session every single time. 
  2. User Sessions may come and go but the associated state may last forever. This implies you cannot maintain everything in memory. You need to offload to disk as well as a super efficient mechanism to persist the state to disk across the cluster. This persistence mechanism cannot be expensive and shouldn't require administration overhead like with common RDBs.
With GemFire, when sessions are updated only updated attributes are sent over the wire - to servers and replicas. Session state can overflow to disks across the cluster and can even be made persistent. The state will recover in parallel even if the entire cluster were to be restarted.

Burst into the cloud: There is increasing interest in the ability to go beyond what the cluster can handle when the load threshold is too great. Ideally, applications can burst into the "cloud" – most likely a private cloud that supports on-demand provisioning of resources using virtualization. Bursting, basically, means new Tomcat instances would get sparked on demand in some remote data center. This may also mean migrating users to the new cluster. Now, wouldn't it be nice if the session state were to magically appear or be accessible from the primary cluster?
Alright! in reality, magic is just an illusion. Nothing happens magically in GemFire. Let me spare you a long narration. With GemFire, one can achieve this by configuring WAN gateways to potential remote clusters. If and when a remote "cloud" cluster were to be launched, the sessions will automatically land up getting replicated.

Sessions span heterogeneous apps: Session state published by one clustered application may need to be accessible by other distributed applications. These applications won't necessarily be deployed as part of the same application server cluster. You need session state that can outlast your application or your cluster, with a storage format such that it is accessible from other languages. Though we don't support pluggable session management for other languages like .NET(yet), the application developer can still use GemFire native serialization and access the session state from other environments like C# and C++. We intend to support seamless session access across many environments in the future.

So, again, You can read about the features, download and give it a try from our community site.

Thursday, October 07, 2010

What is new in VMWare's vFabric GemFire 6.5?

Given the breadth of new capabilities, GemFire 6.5 might as well have been called 7.0. One of the important themes for us has been to make sure all stateful services can be partitioned for near linear scaling. With this release we go beyond partitioning in-memory data and application behavior. We can now manage data on disk in a highly partitioned manner and even process subscriptions with linear scaling. Our pursuit is simple - no matter which features of the product are being used, there is a level of assurance that the application will scale with increasing capacity.

In a nutshell(it is a big nut), here are some of the capabilities we introduced:

  • Database class reliability through Shared-nothing parallel persistence: A unique high performance design to pool disks across the cluster for storing and recovering data. GemFire always supported disk persistence and recovery for replicated data regions but now this capability has been extended for partitioned data also. The design principles adopted are fundamentally different than ones in typical clustered databases. For one, disk storage is shared nothing - each cache process owns its disk store eliminating process level contentions. Second, the design is tilted to favor memory i.e. there are no complex B-Tree data structures on disk; instead we assume complex query navigations will always be done through in-memory indexes. Third, the design uses rolling append-only log files to avoid disk seeks completely. Finally, the design preserves the rebalancing model in GemFire when capacity is increased or decreased - the disk data also relocates itself. Well, there is more to the story which I will cover once we walk through what else is new in 6.5..
  • Simplified and Intuitive programming model: First, we simplified the programming model by reducing some of the boiler plate bootstrapping code that was required in the past and introduced pre-packaged templates for common configurations and topologies. Second, we launched the new Spring-Gemfire project to introduce a clean programming model for the Spring developer. Note that Spring-GemFire is not bundled with GemFire 6.5. If you are already familiar with the GemFire APIs or just getting started, I would recommend going through the new tutorial that walks through some of the new  simplified programming APIs. And, not to worry - your existing application will continue to just run fine. The old APIs are fully supported.
  • Improved scale-out capabilities: Application deployments using the tiered model (client process embedding a local cache talking to a server farm) could see a 2X or more performance gain when accessing partitioned data. With 6.5, client processes gain knowledge about server side partitions and use it on the fly to direct traffic directly to the server with the required data set. Application clients subscribing to events using key based register interest or "continuous queries" now have their subscriptions registered on all the partitioned nodes. This allows each partition to process the subcription filters on the nodes where the data changes are applied dramatically reducing messaging traffic between peers unlike in the previous releases. The net effect is that more subscriptions can be processed and the event delivery latency to clients is also reduced.
  • Colocated transactions: If each logical partition were to own the entire transactional working set then highly applications can scale linearly if the concurrent transaction load is uniformly spread across the data set and hence across all the partitions. Each partition can coordinate its transaction without any impact to other partitions with no locking requirements across partitions. GemFire 6.5 introduces a change to the distributed transaction design to detect cases where the data is colocated and avoids engaging the built-in distribued lock service.
  • C++, C# client improvements: The clients can now receive and send object change "deltas", CQ processing on the client has improved, Parallel Function execution on the Grid is automatially HA, SSL and user level security support has been added and the client cache sizes can be significantly larger with support for 64-bit architectures.

For a complete list of features added in 6.5, click here

Next, I will rant and rave our disk persistence story.

Factors contributing to very high disk throughput:

Pooling: Like mentioned before, each cache instance manages its own disk store and there is no disk contention between processes. Each partition can locally manage its data on local disk(s). Assuming application "write" load can be uniformly balanced across the cluster, the aggregate disk throughput will be (Disk transfer rate * NumOfPartitions), assuming a single disk per partition. Disk transfer rates can be upto 100MB/sec on commodity machines today compared to just 2MB/sec in the 1980s. 


Avoid seeks: By managing most (or all) of the data in cluster memory all reads are served without navigating through BTree based indexes and data files on disk, which as you know will result in continuous seeking on disk. The average disk seek times today are still 2ms or higher. 

Buffered logging: When writes do occur, these operations are simply logged to disk in "append-only" log/data files (See figure above). Appending implies we can continuously write to consecutive sectors on disk without requiring disk head movement. Probably the most controversial decision we had to make was to allow all writes to only be flushed to the OS buffer rather than 'fsync' all the way to disk. The writes are buffered by the IO subsystem in the kernel allowing the IO scheduler to merge and sort disk writes to achieve the highest possible disk throughput. The write requests need not initiate any disk I/O until some time in the future. Thus, from the perspective of a user application, write requests stream at much higher speeds, unencumbered by the performance of the disk. Any data loss risks due to sudden failures at a hardware level is mitigated by having multiple nodes write in parallel to disk. In fact, it is assumed that hardware will fail, especially in large clusters and data centers and software needs to take into account such failures.  The system is designed to recover in parallel from disk and to guarantee data consistency when data copies on disk don't agree with each other. Each member of the distributed system logs membership changes to its persistent files and uses this during recovery to figure out the replica that has the latest changes and automatically synchronizes these changes at startup.

Motivations for native GemFire persistence instead of an RDB

Most data grid deployments today use a RDB as the backing store. A synchronous design where every change is reflected in the database first has obvious challenges for "write heavy" applications. You are only as fast as the RDB and the database could become the single point of failure (SPOF). Transaction execution complexity also increases involving the data grid and the database in a 2-phase commit protocol. 
So, one remedy is to execute all writes on the data grid and asynchronously propagate the changes to the RDB. This pattern also has the same SPOF challenges and not well suited for cases with sustained high write rates. 
Designs that go with Database "shards" - each cache instance writing to its own independent DB instance for scale is interesting but tough to actually implement with good HA characteristics. 

We believe that the GemFire parallel persistence model gets rid of all these limitations. Now, I am not advocating that you shouldn't move the data in the grid back to your classic RDB. You probably need to do this for all sorts of reasons as you pipe the information to upstream and downstream applications. But, think twice if your RDB is merely meant to be a backing store.
Even with customers that remain apprehensive with GemFire acting as the backing store, some want to still manage data on Gemfire disk stores just so that the cluster recovery can be fast. 

Handling failure conditions

The choice to use buffered logs meant we had to make sure the data is written to disk on multiple nodes for reliability. When partitions failed their disk stores can become stale. If later the entire cluster were to be bounced, the recovery logic has to make sure the state of the data in-memory and on disk reflected the latest state. We spent a lot of our energy in making sure the design for recovery always gaurantees consistency and freshness of data. 

In summary, the design is fundamentally very different from the traditional DB approach - data is stored on disk and memory is used to optimize disk IO (manage raw disk blocks instead of app data objects). Instead, in our thinking, data is primarily managed in memory and disk is primarily used for recovery and to address safety concerns. Disk capacity has increased more than 10000-fold over the last 25 years and seems likely to continue increasing in the future. Unfortunately, though, the access rate to information on disk has improved much more slowly: the transfer rate for large blocks has improved “only” 50-fold, and seek time and rotational latency have only improved by a factor of two.

You can read the product documentation on persistence here.

Spring-GemFire integration

We outlined a high level technology integration strategy to our customers and the press when we merged with SpringSource/VMWare on how GemFire will integrate with the spring framework (Java and .NET). The core Spring engineers have already delivered on part of the promise through the first milestone release making it natural for the Spring developer to access GemFire.
Among other things, the integration provides:

  1. Simpler ways to configure a cache and data regions. You can then inject the region into your app POJOs just like any spring dependency.
  2. Allow the Spring developer to use Spring transaction model - declarative configuration and doing transactions consistently across a variety of providers. So, basically, the application doesn't have to explicitly invoke the GFE transaction APIs. Of course, our txn caveats do apply.
  3. Wiring dependencies for callbacks easily: So, for instance, if you are using a CacheWriter or loader that needs DB URL, connection properties, etc you can now use the conventional Spring way of configuring DataSources and can be auto injected into your callback.
  4. ETC

I encourage you to read Costin Leau's blog for specific details, download and give it a try. Your feedback will be very valuable and much appreciated.

I hope to amend this blog post with further details on the various other "scale out" features in 6.5 soon. 
If there is enough interest, go through our community site on 6.5, download and try out the new tutorial. 

Thursday, May 06, 2010

SpringSource/VMWare acquires GemStone

Here is some specific details on the potential technology synergies ....

Synergy with SpringSource

What makes the integration very synergistic and complementary are the areas of focus for the two companies : GemStone/GemFire has primarily focused on getting the infrastructure for clustering and managing data in the middle tier to be extremely reliable and offer tremendous flexibility; SpringSource focus has been to offer a ubiquitous programming model (through the Spring framework) that is simple to adopt and quite open where a myriad of products and technologies can be seamlessly integrated and a light weight runtime environment through the enterprise class TC server and a management environment enabled by HypericHQ. In a sense, the GemFire integration brings first class data management and clustering to the run time - a key component enabling "extreme scale" application deployment.

GemStone, as well as the merged entity is totally committed to support heterogeneous access to the GemFire data fabric with continued support for Java, C++ and .NET. In fact, we are exploring opportunities to integrate with Spring.NET and continue on our commitments to simplify the programming models for non-Java environments.

GemStone management and the SpringSource/VMWare management remains committed to make sure we deliver on the roadmap and future extensions to the platform that have already been discussed with customers. In fact, we will go well beyond our current commitments by leveraging the Spring framework and integration with a multitude of Spring modules to provide a much simpler configuration and development model for our customers.

How might we leverage the Spring framework?

The spring framework is all about choice for enterprise Java applications. We now enable more choice by offering a clustered data management solution that can be used in multiple ways:

1) As a transparent L2 Cache: Spring application using Hibernate will be able to plugin a L2 Cache that can intercept traffic to/from a backend database for scaling and performance reasons. Note that GemFire already supports Hibernate based L2 Cache plugins but now this effort will take increased focus. On the .NET side of things, we will accelerate support for Spring.NET applications using nHibernate.

2) AOP cache: We will be able to offer sophisticated AOP caching interceptors that can transparently use a highly scalable cache.

3) Parallel Data aware method invocations: Spring bean service invocations could transparently make use of the GemFire data-aware parallel function execution capabilities where behavior execution can happen in parallel, operate on localized data sets and go through a "aggregation" phase to produce a final result.

4) Session state management: We already provide an abstraction layer for session state management on top of GemFire today. The intent would be to include this capability as an integral part of the platform. The highlights of this add-on include the ability to handle very large object graphs efficiently, dynamic partitioning and load balancing, visibility of session state across multiple clusters (WAN), HA and the ability to maintain a "edge" cache that maximizes memory utilization through a heap LRU based eviction algorithm.

We are exploring options to enable quick integration of the data fabric into existing Spring applications through enhancements to the Eclipse plugins - for instance, the configuration of the cache regions and deployment topologies will be simplified. We will also investigate how we can leverage Spring Roo (next generation rapid application development tool) where a developer will be able to build a Java application with integrated caching in minutes.

Integration with Spring Modules

Unlike some of the other distributed caching options available in the market, GemFire natively supports memory based, reliable event notifications. Applications can subscribe to data in the fabric through "continuous queries" and can receive in-order, reliable notifications when the data of interest changes. Spring Integration extends the Spring framework into the messaging domain and provides a higher level of abstraction so business components are further isolated from the infrastructure and developers are relieved of complex integration responsibilities. We are exploring techniques to leverage this simple and intuitive API so applications are abstracted away from dealing with GemFire specific notification APIs.

Spring Batch enables extremely high-volume and high performance batch jobs though optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the spring framework in a highly scalable manner to process significant volumes of information. Integration with the GemFire function service will now mean developers can develop Spring batch applications but behind the scenes leverage the parallelism and "data aware" routing capabilities built into GemFire.

There are many integration possibilities all of which will make the job of the developer integrating with a data fabric/grid significantly simpler.

GemFire in the cloud

Our offering will expand the VMWare strategy to deliver 'Platform as a Service' solutions over time. The already powerful arsenal includes VMWare VSphere as the cloud operating system, the Spring framework as the enterprise application development model and SpringSource TCServer as a lightweight application platform. Now, GemStone extends the capabilities with GemStone's GemFire as the scalable, elastic data platform.

While a good case is being offered by VMWare on the current VMWare products positioned for the cloud, I will focus on the rationale for GemFire as a data platform for the cloud.

One of the big advantages with applications deployed in a cloud environment is to tap into additional capacity 'just in time' based on demand changes. The underlying platform needs to detect and respond by exhibiting elastic characteristics. The additional capacity could be provisioned through virtual or physical machines spread across subnets or perhaps even across data centers. This means application behavior as well the associate state (data) needs to horizontal scale through migration. The behavior migration is well taken care of by modern day clustered application servers, but it is lot more challenging with data. The traditional relational database is rooted in a design where the focus is optimizing disk IO and maintaining ACID properties. The frame of reference was quite different - data was centralized and extensive locking/latching techniques used to preserve the ACID transaction properties. The design did not account for data to be managed across a large cluster of heterogeneous machines. Several clustered databases today offer a shared nothing architecture at the database engine layer but fall short at the storage layer (where it is shared everything). For elasticity, we believe the design has to change along the following two important dimensions:

1) distribution orientation - efficient distribution methods to move data around a large network efficiently without loss of consistency

2) memory orientation - when demand spikes it is lot faster and easier to move data between memory segments across machines. GemFire manages data by horizontally partitioning the data across any number of machines, and initiates automatic data rebalancing when additional capacity is added to its cluster. Unlike common database architectures, GemFire primarily manages data in memory and leaves the persistence to disk as a choice for the developer. Besides being more efficient to simply manage ephemeral state such as session state only in memory, regulations in some environments may mandate that nothing be stored on disk.

We also know that applications deployed on the cloud will be subject to higher levels of SLAs - continuous availability and very predictable response times. Data stored in GemFire is typically always synchronously copied to one or more machines often also redundantly stored on disk and even asynchronously copied across data centers in case the entire data center goes down. Again, by allowing the data to be primarily managed in memory, the cost of maintaining data redundantly is considerably reduced. GemFire's built in mechanism to carry out continuous instrumentation and potential integration with cloud provisioning environments (vCloud API) permits apriori detection of changing load patterns to initiate data rebalancing without any operator intervention.

Highly parallelizable, computationally intensive applications such as large scale web applications, or analytic applications (such as risk analysis in finance) are ideal candidates for cloud deployments. A lot of these computational intensive tasks tend to be data intensive also. Spring enabled services can now be parallelized through the use of GemFire's data-aware parallel function service where the underlying application behavior can now be parallelized and executed on the nodes carrying the data sets required by each parallel activity. This data localization will dramatically increase the throughput and reduce costs with little out-of-process data access. Instead of developing applications using the custom GemFire APIs, application development could rely on the open spring community driven modules such as Spring Batch simplifying the development task.

We also believe the combination of SpringSource/VMWare infrastructure and GemFire will enable customers to implement "cloud bursting" strategies where on-premises stateful services can much more easily expand to a in-house private or even a public cloud.

Vision of a first class middle tier data management platform

GemFire started off as a pure main memory distributed cache where transactional updates were immediately persisted to the database of record. In recent years, with the addition of capabilities such as reliable asynchronous writes to the database, data partitioning with dynamic rebalancing and wide area network data management the product now is often used as the high performance store with data written to the database either in batches, at the end of the business day or a batch run. The database is being relegated as the archival store for integration purposes.

We are now adding support for parallel shared nothing disk storage layer in the product. Unlike other cluster database management system designs where the disks are shared across many distributed processes, each member of the cluster can persist in parallel to their local disks. Transactions do not result in disk flushes but rather allow the IO scheduler to time the block writes to disk dramatically increased disk write throughput. Any data loss risks due to sudden failures at a hardware level are mitigated by having multiple nodes write in parallel to disk. In fact, it is assumed that hardware will fail, especially in large clusters and data centers and software needs to take into account such failures. The system is designed to recover in parallel from disk and to guarantee data consistency when data copies on disk don't agree with each other.

It is our strong belief that with the proliferation of highly distributed middle tier architectures, a data tier that is colocated with the application tier, is primarily clustered and is used to manage ephemeral data will substantially change the architectures for all enterprise applications that demand scaling and predictability.

It is with this belief that we built the new SQLFabric product - same underpinnings as GemFire but uses SQL as the interface. We took this step in spite of the recent push by the "no sql" database alternatives for the cloud. Our decades of database implementation and research experience does not implicate SQL as a query language when it comes to the desired characteristics of elasticity and performance. This is simply a matter of design choices made by SQL database vendors based on a frame of reference that is no longer valid. In other words, there is nothing wrong with SQL as a language but rather it is the design of disk oriented, centralized databases that is sub-optimal. The premise behind SQLFabric is to capitalize on the power of SQL as an expressive, flexible, very well understood query language, but alter the design underpinnings in common databases for scalability and high performance. By providing ubiquitous interfaces like JDBC and ADO.NET the product becomes significantly easier to adopt and integrates well into existing eco-systems.