JagsLog

SQLFire 1.0 - the Data Fabric Sequel

2011-12-13T19:03:00.000-08:00

This week we finally reached GA status for VMWare vFabric SQLFire - a memory-optimized distributed SQL database delivering dynamic scalability and high performance for data-intensive modern applications.

In this post, I will highlight some important elements in our design and draw out some of our core values.

The current breed of popular NoSQL stores promote different approaches to data modelling, storage architectures and consistency models to solve the scalability and performance problems in relational databases. The overarching messages in all of them seems to suggest that the core of the problem with traditional relational databases is SQL.
But, ironically, the core of the scalability problem has little to do with SQL itself - it is the manner in which the traditional DB manages disk buffers, manages its locks and latches through a centralized architecture to preserve strict ACID properties that represents a challenge. Here is a slide from research at MIT and Brown university on where the time is spent in OLTP databases.

Design center
With SQLFire we change the design center in a few interesting ways:
1) Optimize for main memory: we assume memory is abundant across a cluster of servers and optimize the design through highly concurrent data structures all resident in memory. The design is not concerned with buffering contiguous disk blocks in memory but rather manages application rows in memory hashmaps in a form so it can be directly consumed by clients. Changes are synchronously propagated to redundants in the cluster for HA.

2) Rethink ACID transactions: There is no support for strict serializable transactions but assume that most applications can get by with simpler "read committed" and "repeatable read" semantics. Instead of worrying about "read ahead" transaction logs on disk, all transactional state resides in distributed memory and uses a non-2PC commit algorithm optimized for small duration, non-overlapping transactions. The central theme is to avoid any single points of contentions like with a distribtued lock service. See some details here.

3) "Partition aware DB design": Almost every single high scale DB solution offers a way to linearly scale by hashing keys to a set of partitions. But, how do you make SQL queries and DML scale when they involve joins or complex conditions? Given that distributed joins inherently don't scale we promote the idea that the designer should think about common data access patterns and choose the partitioning strategy accordingly. To make things relatively simple for the designer, we extended the DDL (Data definition language in SQL) so the designer can specify how related data should be colocated ( for instance 'create table Orders (...) colocate with Customer' tells us that the order records for a customer should always be colocated onto the same partition). The colocation now makes join processing and query optimization a local partition problem (avoids large transfers of intermediate data sets). The design assumes classic OLTP workload patterns where vast majority of individual requests can be pruned to a few nodes and that the concurrent workload from all users is spread across the entire data set (and, hence across all the partitions). Look here for some details.

4) Shared nothing logs on disk: Disk stores are merely "append only" logs and designed so that application writes are never exposed to the disk seek latencies. Writes are synchronously streamed to disk on all replicas. A lot of the disk store design looks similar to other NoSQL systems - rolling logs, background/offline compression, memory tables pointing to disk offsets, etc. But, the one aspect that represents core IP is all around managing consistent copies on disk in the face of failures. Given that distributed members can come and go, how do we make sure that the disk state a member is working with is the one I should be working with? I cover our "shared nothing disk architecture" in lot more detail here.

5) Parallelize data access and application behavior: We extend the classic stored procedure model by allowing applications to parallelize the procedure across the cluster or just a subset of nodes by hinting the data the procedure is dependent on. This applicaton hinting is done by supplying a "where clause" that is used to determine where to route and parallelize the execution. Unlike traditional databases, procedures can be arbitrary application Java code (you can infact embed the cluster members in your Spring container) and run collocated with the data. Yes, literally in the same process space where the data is stored. Controversial, yes, but, now your application code can do a scan as efficiently as the database engine.

6) Dynamic rebalancing of data and behavior: This is the act of figuring out what data buckets should be migrated when new capacity (cluster size grows) is allocated (or removed) and how to do this without causing consistency issues or introducing contention points for concurrent readers and writes. Here is the patent that describes some aspects of the design.

Embedded or a client-server topology
SQLFire supports switching from the classic client-server (your DB runs in its own processes) topology to embedded mode where the DB cluster and the application cluster is one and the same (for Java apps).
We believe the emdedded model will be very useful in scenarios where the data sets are relatively small. It simplifies deployment concerns and at the same time provides significant boost in performance when replicated tables are in use.

All you do is change the DB URL from
'jdbc:sqlfire://server_Host:port' to 'jdbc:sqlfire:;mcast-port=portNum' and now all your application processes that use the same DB URL will become part of a single distributed system. Essentially, the mcast-port port identifies a broadcast channel for membership gossiping. New servers will automatically join the cluster once authenticated. Any replicated tables will automatically get hosted in the new process and partitioned tables could get rebalanced and share some of the data with the new process. All this is abstracted away from the developer.
As far as the application is concerned, you just create connections and execute SQL like with any other DB.

How well does it perform and scale?
Here are the results of a simple benchmark done internally using commodity (2 CPU) machines showcasing linear scaling with concurrent user load. I will soon augment this with more interesting workload characterization. The details are here.

Comparing SQLFire and GemFire

Here is a high level view into how the two products compare. I hope to add a blog post that provides specific details on the differences and use cases where one might apply better than the other.

SQLFire benefits from the years of commercially deployed production code found in GemFire. SQLFire adds a rich SQL engine with the idea that now folks can manage operational data primarily in memory, partitioned across any number of nodes and with a disk architecture that avoids disk seeks. Note the two offerings, SQLFire and GemFire, are completely unique products and deployed separately

As always, I would love to get your candid feedback (link to our forum). I assure you that trying it out is very simple - just like using Apache Derby or H2.

Get to the download, docs and quickstart all from here. The developer license is perpetual and works on upto 3 server nodes.

HPTS 2011 talk on 'Flexible OLTP in the future'

2011-11-20T21:13:00.001-08:00

I recently spoke at HPTS 2011 (High Performance Transaction Systems). If you haven't already you should check out some of the very interesting content on NoSQL ecosystem, future in core density, big data experiences and scars, etc.

Here is the abstract:

Flexible OLTP data models in the future
=================================

There has been a flurry of highly scalable data stores and a dramatic spike in the interest level. The solutions with the most mindshare seem to be inspired by Dynamo's (Amazon) eventually consistency model or a data model that promotes nested, self-describing data structures like BigTable from Google. At the same time you see projects within these corporations evolving to architectures like MegaStore and Dremel (Google) where features from the column-oriented data model is blended together with the relational model.

The shift from just highly structured data to unstructured and semistructured content is evident. New applications are being developed or existing applications are being modified at break neck speed. Developers want the data model evolution to be extremely simple and want support for nested structures so they can map to representations like JSON with ease so there is little impedance between the application programming model and the database. Next generation enterprise applications will increasingly work with structured and semi-structured data from a multitude of data sources. A pure relational model is too rigid and a pure BigTable like model has too many shortcomings and cannot be integrated with existing relational databases systems.

In this talk, I present an alternative. We prefer the familiar "row oriented" over "column oriented" approach but still tilt the relational model - mostly the schema definition to support partitioning and colocation, redundancy level and support for dynamic and nested columns.
Each of these extensions will support different desired attributes - partitioning and colocation primitives cover horizontal scaling, availability primitives allow explicit support for replication model and the placement policies (local vs across data centers), dynamic columns will address flexibility for schema evolution (different rows have different columns and added with no DDL requirements) and nested columns that support organizing data in a hierarchy.

We draw inspiration for the data model from Pat helland's 'Life beyond distributed transactions' by adopting entity groups as a first class artifact designers start with, and define relationships between entities within the group (associations based on reference as well as containment). Rationalizing the design around entity groups will force the designer to think about data access patterns and how the data will be colocated in partitions. We then cover why ACID properties and sophiticated querying becomes significantly less challenging to accomplish. There are many ideas around partitioning policies, tradeoffs in supporting transactions and joins across entity groups that are worth discussion.

The idea is to present a model and generate discussion on how to achieve the best of both worlds. Flexible schemas without losing referential integrity, support for associations and the power of SQL. It is ironic that NoSQL databases like Mongodb are getting to be more popular as they begin to add SQL like querying capabilities.

Hpts 2011 flexible_oltp

View more presentations from Jags Ramnarayan.

Finally, this summarizes all the different views shared at HPTS.

What is new in vFabric GemFire 6.6?

2011-09-19T15:53:00.000-07:00

GemFire 6.6 was released (Sept 2011) as part of the new vFabric 5.0 product suite and, it represents a big step along the following important dimensions:

developer productivity
more DBMS like features
better scaling features

Here are some highlights on each dimension:

Developer productivity: Introduced a new serialization framework called PDX (stands for Portable Data eXchange and not my favorite airport).
PDX is a framework that provides a portable, compact, language neutral and versionable format for representing object data in GemFire. It is proprietary but designed for high efficiency. It is comparable to other serialization frameworks like apache Avro, Google protobuf ,etc.
Alright. I realize the above definition is a mouth full :-)

Simply put, the framework supports versioning allowing apps using older class versions to work with apps with newer versions of the domain classes and vice versa, provides a format and type system for interop between the various languages and an API so server side application code can operate on objects without requiring the domain classes (i.e. no deserialization).
The type evolution has to be incremental - this is the only way to avoid data loss or exceptions.
The raw serialization performance is comparable to Avro, protobuf but is much more optimized for distribution and operating in a GemFire cluster. The chart below is the result of a open source benchmark on popular serialization frameworks. The details are available here. 'Total' represents the total time required to create, serialize and then deserialize. See the benchmark description for details.

You can either implement serialization callbacks (for optimal performance) or simply use the built in PDXSerializer (reflection based today). Arguably, the best part of the framework is its support for object access in server side functions or callbacks like listeners without requiring the application classes. You can dynamically discover the fields and nested objects and operate on these using the PDX API. On the application client that has the domain classes the same PDXInstance is automatically turned into the domain object.

We introduced a new command shell called gfsh (pronounced "gee - fish" ) - a command line tool for browsing and editing data stored in GemFire. Its rich set of Unix-flavored commands allows you to easily access data, monitor peers, redirect outputs to files, and run batch scripts. This is an initial step towards a more complete tool that can provision, monitor, debug, tune and administer a cluster as a whole. Ultimately, we hope to advance the gfsh scripting language making integration of GemFire deployments into cloud like virtualized environments a "breeze".

More DBMS like:
Querying and Indexing
we added several features to our query engine - query/index on hashmaps, bind parameters from edge clients, OrderBy support for partitioned data regions, full support for LIKE predicates and being able to index regions that overflow to disk.

Increasingly we see developers wanting to decouple the data model in GemFire from the class schema used within their applicatons. Even though PDX offers an excellent option, we also see developers mapping their data into "self describing" hashmaps in GemFire. The data store is basically "schema free" and allows many application teams to change the object model without impacting each other. Given a simple KV storage model in GemFire this has never been an issue except for querying. Now, not only can you store maps, you can index keys within these Hashmaps and execute highly performant queries.
Do take note that the query engine now natively understands the PDX data structures with no need for application classes on servers.

We expanded distributed transactions by allowing edge clients to initiate or terminate transactions. No need to invoke a server side function for transactions. We also added a new JCA resource adapter that supports participation in externally coordinated transactions as a "Last resource".

Finally, on the scaling dimension:
You are probably aware that GemFire's shared nothing persistence relies on append-only operation logs to provide very high write throughput. There are no additional Btree data files to maintain like in a traditional database system. The tradeoff with this design is cluster recovery speed. One has to walk through the logs to recover the data back into memory and the time for the entire cluster to bootstrap from disk is proportional to the volume of data (and inversely proportional to the cluster count). And, this can be long (put mildly) with large data volumes even though you can parallelize the recovery across the cluster. To minimize this recovery delay, the 6.6 persistence layer now also manages "key files" on disk. We simply recover the keys back into memory and lazily recover the data giving recovery in general a significant performance boost.

Prior to 6.6, GemFire randomly picked a different host to manage redundant copies for partitioned data regions. Often, customers provision multiple racks and want their redundant copy to always be stored on a different physical rack. Occasionally, we also see customers wanting to store their redundant data on a different site. We added support for "redundancy zone" in partitioned region configuration allowing users to identify one or more redundancy zones (could be racks, sites, etc). GemFire will automatically enforce managing redundants in different zones.

Everything mentioned happens to be more of a prelude. The list of enhancements is much longer and is documented here.

The product documentation is available here.

You can start discussions here.

Would love to hear your thoughts.

Announcing vFabric GemFire HTTP session management module for Tomcat/TCServer

2010-11-14T21:42:00.000-08:00

Clustering of HTTP Sessions has been around for a while. I suspect many of you are already wondering how is this better or unique compared to the built-in clustering mechanisms in application servers or third party solutions like EHCache, Oracle Coherence, etc. Several of our customers already manage sessions in GemFire explicitly using our API. The need to effectively manage large quantities of session state is too great not to use a distributed data grid solution such as GemFire, which can partition session state in memory across the cluster for scale and maintain synchronous copies of session state in memory on multiple nodes to provide high availability (HA).

So, what is new? This announcement brings clean, pluggable integration of GemFire within Tomcat and SpringSource tc Server environments. HTTP session management is enabled through very simple configuration steps. Unlike other competitive products, there is no vendor-specificonfiguration of "cache" XML in most cases. If you understand your application volume and scale requirements then it might just be a matter of configuring one of the pre-packaged templates when you start up tc Server instances.

You can read about the features, download and give it a try from our community site.

But, there are several interesting patterns that go beyond this basic value proposition and I will attempt to cover these below ...

Support for multi site clustering: Today, HA for sessions implies copies maintained on 2 or more servers that are part of the same cluster. If the entire cluster were to fail or becomes unreachable (say a network split occurs) your applications is likely going to fail over to some DR cluster but without access to the sessions. Wouldn't it be nice if your session state survives such failure conditions? With Gemfire, you can configure the session state to be replicated over what we call "WAN gateways"(See section "multi site setup") to your DR cluster. The replication is asynchronous with support for batching and conflation.

Sessions are getting increasingly obese and may never die: Session state often reference personalization information - things like preferences, past transactions and even things like past chat sessions with customer support. Traditionally, all this data was only managed in the RDB and fetched every single time the user logs in - one of the main causes of database overload. Increasingly, I see sessions that range in size from a few KB to several MB each. These are represented by convoluted object graphs and when changes occur, they represent only a minuscule fraction of the entire session state. Managing complex, constantly changing and ever growing session state requires special consideration:

when sessions change the session replication layer needs to be smart about only replicating the changes and not the entire session every single time.
User Sessions may come and go but the associated state may last forever. This implies you cannot maintain everything in memory. You need to offload to disk as well as a super efficient mechanism to persist the state to disk across the cluster. This persistence mechanism cannot be expensive and shouldn't require administration overhead like with common RDBs.

With GemFire, when sessions are updated only updated attributes are sent over the wire - to servers and replicas. Session state can overflow to disks across the cluster and can even be made persistent. The state will recover in parallel even if the entire cluster were to be restarted.

Burst into the cloud: There is increasing interest in the ability to go beyond what the cluster can handle when the load threshold is too great. Ideally, applications can burst into the "cloud" – most likely a private cloud that supports on-demand provisioning of resources using virtualization. Bursting, basically, means new Tomcat instances would get sparked on demand in some remote data center. This may also mean migrating users to the new cluster. Now, wouldn't it be nice if the session state were to magically appear or be accessible from the primary cluster?
Alright! in reality, magic is just an illusion. Nothing happens magically in GemFire. Let me spare you a long narration. With GemFire, one can achieve this by configuring WAN gateways to potential remote clusters. If and when a remote "cloud" cluster were to be launched, the sessions will automatically land up getting replicated.

Sessions span heterogeneous apps: Session state published by one clustered application may need to be accessible by other distributed applications. These applications won't necessarily be deployed as part of the same application server cluster. You need session state that can outlast your application or your cluster, with a storage format such that it is accessible from other languages. Though we don't support pluggable session management for other languages like .NET(yet), the application developer can still use GemFire native serialization and access the session state from other environments like C# and C++. We intend to support seamless session access across many environments in the future.

So, again, You can read about the features, download and give it a try from our community site.

What is new in VMWare's vFabric GemFire 6.5?

2010-10-07T12:49:00.000-07:00

Given the breadth of new capabilities, GemFire 6.5 might as well have been called 7.0. One of the important themes for us has been to make sure all stateful services can be partitioned for near linear scaling. With this release we go beyond partitioning in-memory data and application behavior. We can now manage data on disk in a highly partitioned manner and even process subscriptions with linear scaling. Our pursuit is simple - no matter which features of the product are being used, there is a level of assurance that the application will scale with increasing capacity.

In a nutshell(it is a big nut), here are some of the capabilities we introduced:

Database class reliability through Shared-nothing parallel persistence: A unique high performance design to pool disks across the cluster for storing and recovering data. GemFire always supported disk persistence and recovery for replicated data regions but now this capability has been extended for partitioned data also. The design principles adopted are fundamentally different than ones in typical clustered databases. For one, disk storage is shared nothing - each cache process owns its disk store eliminating process level contentions. Second, the design is tilted to favor memory i.e. there are no complex B-Tree data structures on disk; instead we assume complex query navigations will always be done through in-memory indexes. Third, the design uses rolling append-only log files to avoid disk seeks completely. Finally, the design preserves the rebalancing model in GemFire when capacity is increased or decreased - the disk data also relocates itself. Well, there is more to the story which I will cover once we walk through what else is new in 6.5..
Simplified and Intuitive programming model: First, we simplified the programming model by reducing some of the boiler plate bootstrapping code that was required in the past and introduced pre-packaged templates for common configurations and topologies. Second, we launched the new Spring-Gemfire project to introduce a clean programming model for the Spring developer. Note that Spring-GemFire is not bundled with GemFire 6.5. If you are already familiar with the GemFire APIs or just getting started, I would recommend going through the new tutorial that walks through some of the new simplified programming APIs. And, not to worry - your existing application will continue to just run fine. The old APIs are fully supported.
Improved scale-out capabilities: Application deployments using the tiered model (client process embedding a local cache talking to a server farm) could see a 2X or more performance gain when accessing partitioned data. With 6.5, client processes gain knowledge about server side partitions and use it on the fly to direct traffic directly to the server with the required data set. Application clients subscribing to events using key based register interest or "continuous queries" now have their subscriptions registered on all the partitioned nodes. This allows each partition to process the subcription filters on the nodes where the data changes are applied dramatically reducing messaging traffic between peers unlike in the previous releases. The net effect is that more subscriptions can be processed and the event delivery latency to clients is also reduced.
Colocated transactions: If each logical partition were to own the entire transactional working set then highly applications can scale linearly if the concurrent transaction load is uniformly spread across the data set and hence across all the partitions. Each partition can coordinate its transaction without any impact to other partitions with no locking requirements across partitions. GemFire 6.5 introduces a change to the distributed transaction design to detect cases where the data is colocated and avoids engaging the built-in distribued lock service.
C++, C# client improvements: The clients can now receive and send object change "deltas", CQ processing on the client has improved, Parallel Function execution on the Grid is automatially HA, SSL and user level security support has been added and the client cache sizes can be significantly larger with support for 64-bit architectures.

For a complete list of features added in 6.5, click here.

Next, I will rant and rave our disk persistence story.

Factors contributing to very high disk throughput:

Pooling: Like mentioned before, each cache instance manages its own disk store and there is no disk contention between processes. Each partition can locally manage its data on local disk(s). Assuming application "write" load can be uniformly balanced across the cluster, the aggregate disk throughput will be (Disk transfer rate * NumOfPartitions), assuming a single disk per partition. Disk transfer rates can be upto 100MB/sec on commodity machines today compared to just 2MB/sec in the 1980s.

reference : http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf

Avoid seeks: By managing most (or all) of the data in cluster memory all reads are served without navigating through BTree based indexes and data files on disk, which as you know will result in continuous seeking on disk. The average disk seek times today are still 2ms or higher.

Buffered logging: When writes do occur, these operations are simply logged to disk in "append-only" log/data files (See figure above). Appending implies we can continuously write to consecutive sectors on disk without requiring disk head movement. Probably the most controversial decision we had to make was to allow all writes to only be flushed to the OS buffer rather than 'fsync' all the way to disk. The writes are buffered by the IO subsystem in the kernel allowing the IO scheduler to merge and sort disk writes to achieve the highest possible disk throughput. The write requests need not initiate any disk I/O until some time in the future. Thus, from the perspective of a user application, write requests stream at much higher speeds, unencumbered by the performance of the disk. Any data loss risks due to sudden failures at a hardware level is mitigated by having multiple nodes write in parallel to disk. In fact, it is assumed that hardware will fail, especially in large clusters and data centers and software needs to take into account such failures. The system is designed to recover in parallel from disk and to guarantee data consistency when data copies on disk don't agree with each other. Each member of the distributed system logs membership changes to its persistent files and uses this during recovery to figure out the replica that has the latest changes and automatically synchronizes these changes at startup.

Motivations for native GemFire persistence instead of an RDB

Most data grid deployments today use a RDB as the backing store. A synchronous design where every change is reflected in the database first has obvious challenges for "write heavy" applications. You are only as fast as the RDB and the database could become the single point of failure (SPOF). Transaction execution complexity also increases involving the data grid and the database in a 2-phase commit protocol.

So, one remedy is to execute all writes on the data grid and asynchronously propagate the changes to the RDB. This pattern also has the same SPOF challenges and not well suited for cases with sustained high write rates.

Designs that go with Database "shards" - each cache instance writing to its own independent DB instance for scale is interesting but tough to actually implement with good HA characteristics.

We believe that the GemFire parallel persistence model gets rid of all these limitations. Now, I am not advocating that you shouldn't move the data in the grid back to your classic RDB. You probably need to do this for all sorts of reasons as you pipe the information to upstream and downstream applications. But, think twice if your RDB is merely meant to be a backing store.

Even with customers that remain apprehensive with GemFire acting as the backing store, some want to still manage data on Gemfire disk stores just so that the cluster recovery can be fast.

Handling failure conditions

The choice to use buffered logs meant we had to make sure the data is written to disk on multiple nodes for reliability. When partitions failed their disk stores can become stale. If later the entire cluster were to be bounced, the recovery logic has to make sure the state of the data in-memory and on disk reflected the latest state. We spent a lot of our energy in making sure the design for recovery always gaurantees consistency and freshness of data.

Summary

In summary, the design is fundamentally very different from the traditional DB approach - data is stored on disk and memory is used to optimize disk IO (manage raw disk blocks instead of app data objects). Instead, in our thinking, data is primarily managed in memory and disk is primarily used for recovery and to address safety concerns. Disk capacity has increased more than 10000-fold over the last 25 years and seems likely to continue increasing in the future. Unfortunately, though, the access rate to information on disk has improved much more slowly: the transfer rate for large blocks has improved “only” 50-fold, and seek time and rotational latency have only improved by a factor of two.

You can read the product documentation on persistence here.

Spring-GemFire integration

We outlined a high level technology integration strategy to our customers and the press when we merged with SpringSource/VMWare on how GemFire will integrate with the spring framework (Java and .NET). The core Spring engineers have already delivered on part of the promise through the first milestone release making it natural for the Spring developer to access GemFire.

Among other things, the integration provides:

Simpler ways to configure a cache and data regions. You can then inject the region into your app POJOs just like any spring dependency.
Allow the Spring developer to use Spring transaction model - declarative configuration and doing transactions consistently across a variety of providers. So, basically, the application doesn't have to explicitly invoke the GFE transaction APIs. Of course, our txn caveats do apply.
Wiring dependencies for callbacks easily: So, for instance, if you are using a CacheWriter or loader that needs DB URL, connection properties, etc you can now use the conventional Spring way of configuring DataSources and can be auto injected into your callback.
ETC

I encourage you to read Costin Leau's blog for specific details, download and give it a try. Your feedback will be very valuable and much appreciated.

I hope to amend this blog post with further details on the various other "scale out" features in 6.5 soon.

If there is enough interest, go through our community site on 6.5, download and try out the new tutorial.

http://community.gemstone.com/display/gemfire/GemFire+Enterprise

http://community.gemstone.com/display/gemfire/Getting+Started

----

Cheers!

SpringSource/VMWare acquires GemStone

2010-05-06T13:20:00.000-07:00

Here is some specific details on the potential technology synergies ....

Synergy with SpringSource

What makes the integration very synergistic and complementary are the areas of focus for the two companies : GemStone/GemFire has primarily focused on getting the infrastructure for clustering and managing data in the middle tier to be extremely reliable and offer tremendous flexibility; SpringSource focus has been to offer a ubiquitous programming model (through the Spring framework) that is simple to adopt and quite open where a myriad of products and technologies can be seamlessly integrated and a light weight runtime environment through the enterprise class TC server and a management environment enabled by HypericHQ. In a sense, the GemFire integration brings first class data management and clustering to the run time - a key component enabling "extreme scale" application deployment.

GemStone, as well as the merged entity is totally committed to support heterogeneous access to the GemFire data fabric with continued support for Java, C++ and .NET. In fact, we are exploring opportunities to integrate with Spring.NET and continue on our commitments to simplify the programming models for non-Java environments.

GemStone management and the SpringSource/VMWare management remains committed to make sure we deliver on the roadmap and future extensions to the platform that have already been discussed with customers. In fact, we will go well beyond our current commitments by leveraging the Spring framework and integration with a multitude of Spring modules to provide a much simpler configuration and development model for our customers.

How might we leverage the Spring framework?

The spring framework is all about choice for enterprise Java applications. We now enable more choice by offering a clustered data management solution that can be used in multiple ways:

1) As a transparent L2 Cache: Spring application using Hibernate will be able to plugin a L2 Cache that can intercept traffic to/from a backend database for scaling and performance reasons. Note that GemFire already supports Hibernate based L2 Cache plugins but now this effort will take increased focus. On the .NET side of things, we will accelerate support for Spring.NET applications using nHibernate.

2) AOP cache: We will be able to offer sophisticated AOP caching interceptors that can transparently use a highly scalable cache.

3) Parallel Data aware method invocations: Spring bean service invocations could transparently make use of the GemFire data-aware parallel function execution capabilities where behavior execution can happen in parallel, operate on localized data sets and go through a "aggregation" phase to produce a final result.

4) Session state management: We already provide an abstraction layer for session state management on top of GemFire today. The intent would be to include this capability as an integral part of the platform. The highlights of this add-on include the ability to handle very large object graphs efficiently, dynamic partitioning and load balancing, visibility of session state across multiple clusters (WAN), HA and the ability to maintain a "edge" cache that maximizes memory utilization through a heap LRU based eviction algorithm.

We are exploring options to enable quick integration of the data fabric into existing Spring applications through enhancements to the Eclipse plugins - for instance, the configuration of the cache regions and deployment topologies will be simplified. We will also investigate how we can leverage Spring Roo (next generation rapid application development tool) where a developer will be able to build a Java application with integrated caching in minutes.

Integration with Spring Modules

Unlike some of the other distributed caching options available in the market, GemFire natively supports memory based, reliable event notifications. Applications can subscribe to data in the fabric through "continuous queries" and can receive in-order, reliable notifications when the data of interest changes. Spring Integration extends the Spring framework into the messaging domain and provides a higher level of abstraction so business components are further isolated from the infrastructure and developers are relieved of complex integration responsibilities. We are exploring techniques to leverage this simple and intuitive API so applications are abstracted away from dealing with GemFire specific notification APIs.

Spring Batch enables extremely high-volume and high performance batch jobs though optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the spring framework in a highly scalable manner to process significant volumes of information. Integration with the GemFire function service will now mean developers can develop Spring batch applications but behind the scenes leverage the parallelism and "data aware" routing capabilities built into GemFire.

There are many integration possibilities all of which will make the job of the developer integrating with a data fabric/grid significantly simpler.

GemFire in the cloud

Our offering will expand the VMWare strategy to deliver 'Platform as a Service' solutions over time. The already powerful arsenal includes VMWare VSphere as the cloud operating system, the Spring framework as the enterprise application development model and SpringSource TCServer as a lightweight application platform. Now, GemStone extends the capabilities with GemStone's GemFire as the scalable, elastic data platform.

While a good case is being offered by VMWare on the current VMWare products positioned for the cloud, I will focus on the rationale for GemFire as a data platform for the cloud.

One of the big advantages with applications deployed in a cloud environment is to tap into additional capacity 'just in time' based on demand changes. The underlying platform needs to detect and respond by exhibiting elastic characteristics. The additional capacity could be provisioned through virtual or physical machines spread across subnets or perhaps even across data centers. This means application behavior as well the associate state (data) needs to horizontal scale through migration. The behavior migration is well taken care of by modern day clustered application servers, but it is lot more challenging with data. The traditional relational database is rooted in a design where the focus is optimizing disk IO and maintaining ACID properties. The frame of reference was quite different - data was centralized and extensive locking/latching techniques used to preserve the ACID transaction properties. The design did not account for data to be managed across a large cluster of heterogeneous machines. Several clustered databases today offer a shared nothing architecture at the database engine layer but fall short at the storage layer (where it is shared everything). For elasticity, we believe the design has to change along the following two important dimensions:

1) distribution orientation - efficient distribution methods to move data around a large network efficiently without loss of consistency

2) memory orientation - when demand spikes it is lot faster and easier to move data between memory segments across machines. GemFire manages data by horizontally partitioning the data across any number of machines, and initiates automatic data rebalancing when additional capacity is added to its cluster. Unlike common database architectures, GemFire primarily manages data in memory and leaves the persistence to disk as a choice for the developer. Besides being more efficient to simply manage ephemeral state such as session state only in memory, regulations in some environments may mandate that nothing be stored on disk.

We also know that applications deployed on the cloud will be subject to higher levels of SLAs - continuous availability and very predictable response times. Data stored in GemFire is typically always synchronously copied to one or more machines often also redundantly stored on disk and even asynchronously copied across data centers in case the entire data center goes down. Again, by allowing the data to be primarily managed in memory, the cost of maintaining data redundantly is considerably reduced. GemFire's built in mechanism to carry out continuous instrumentation and potential integration with cloud provisioning environments (vCloud API) permits apriori detection of changing load patterns to initiate data rebalancing without any operator intervention.

Highly parallelizable, computationally intensive applications such as large scale web applications, or analytic applications (such as risk analysis in finance) are ideal candidates for cloud deployments. A lot of these computational intensive tasks tend to be data intensive also. Spring enabled services can now be parallelized through the use of GemFire's data-aware parallel function service where the underlying application behavior can now be parallelized and executed on the nodes carrying the data sets required by each parallel activity. This data localization will dramatically increase the throughput and reduce costs with little out-of-process data access. Instead of developing applications using the custom GemFire APIs, application development could rely on the open spring community driven modules such as Spring Batch simplifying the development task.

We also believe the combination of SpringSource/VMWare infrastructure and GemFire will enable customers to implement "cloud bursting" strategies where on-premises stateful services can much more easily expand to a in-house private or even a public cloud.

Vision of a first class middle tier data management platform

GemFire started off as a pure main memory distributed cache where transactional updates were immediately persisted to the database of record. In recent years, with the addition of capabilities such as reliable asynchronous writes to the database, data partitioning with dynamic rebalancing and wide area network data management the product now is often used as the high performance store with data written to the database either in batches, at the end of the business day or a batch run. The database is being relegated as the archival store for integration purposes.

We are now adding support for parallel shared nothing disk storage layer in the product. Unlike other cluster database management system designs where the disks are shared across many distributed processes, each member of the cluster can persist in parallel to their local disks. Transactions do not result in disk flushes but rather allow the IO scheduler to time the block writes to disk dramatically increased disk write throughput. Any data loss risks due to sudden failures at a hardware level are mitigated by having multiple nodes write in parallel to disk. In fact, it is assumed that hardware will fail, especially in large clusters and data centers and software needs to take into account such failures. The system is designed to recover in parallel from disk and to guarantee data consistency when data copies on disk don't agree with each other.

It is our strong belief that with the proliferation of highly distributed middle tier architectures, a data tier that is colocated with the application tier, is primarily clustered and is used to manage ephemeral data will substantially change the architectures for all enterprise applications that demand scaling and predictability.

It is with this belief that we built the new SQLFabric product - same underpinnings as GemFire but uses SQL as the interface. We took this step in spite of the recent push by the "no sql" database alternatives for the cloud. Our decades of database implementation and research experience does not implicate SQL as a query language when it comes to the desired characteristics of elasticity and performance. This is simply a matter of design choices made by SQL database vendors based on a frame of reference that is no longer valid. In other words, there is nothing wrong with SQL as a language but rather it is the design of disk oriented, centralized databases that is sub-optimal. The premise behind SQLFabric is to capitalize on the power of SQL as an expressive, flexible, very well understood query language, but alter the design underpinnings in common databases for scalability and high performance. By providing ubiquitous interfaces like JDBC and ADO.NET the product becomes significantly easier to adopt and integrates well into existing eco-systems.

New Era for OLTP databases

2009-11-22T10:08:00.000-08:00

Coming out of hibernation to blog. Made exciting with our new launch of the SQL data fabric. No wonder, when you start talking about a "first class" SQL interface to our distributed data grid/fabric platform it is compared with the RDB and modern day clustering extensions to the relational database.

Here is a zooming presentation that attempts to position the product as a horizontal partitioned memory oriented database (unlike the typical parallel DBMS that is vertically partitioned for OLAP class applications). I think you will find the use of prezi fascinating. Would love to get some feedback on the presentation. All this is still very preliminary and we will be sharing a lot more details in the days ahead.

(Click MORE --> full screen to get a better view)

here is a related webcast from our principal architect, Dave Brown

Challenges in throughput scaling in a partitioned design

2009-06-11T14:12:00.000-07:00

Here is an interesting article from Billy Newport posted on highscalability.com

The gist of this article is a point that hits home for me - you cannot assume higher scaling (especially throughput) just by adding more partitions for managing your data.

The article correctly points out that one slow partition can throttle the system throughput for scatter-gather type queries that are parallelized on each partition. But, this assumes that all nodes are equal (or remain equal) from a load standpoint. I think, this ability to detect and handle "hotspots" is something that deserves a lot more attention than it gets. I suspect, there is this general myth that increase in the number of partitions will always result in higher throughput.

For instance, we at GemStone have seen the following interesting cases (+ more):
(1) 80-20 rule: Often, it is the 20% of the data that is lot more popular at any given time. So, you have a situation where too many clients are converging onto a few partitions creating imbalance. A whole bunch of your partitions are just idle and heavily underutilized. You can mitigate this to some extent by creating more replicas for the "hot" data, but, it can be difficult to predict when you need the extra copy and for how long.

(2) GC pauses: Consider a case where a large number of clients are updating partitioned data deployed in a JVM based data grid/fabric. Sooner or later, you are exposed to the dreaded "full gc" cycle causing the incoming client to be paused. Say, your normal update request takes fraction of a millisecond and let us also assume that the client requests are uniformly balanced across all the partitions. Essentially, it is more or less guaranteed that all clients will vector to each partition atleast once every second. Now, a full GC in any one partition causes every single client to be now pause. Have a GC that takes a whole minute, and you have all clients now waiting. Worse, the moment this paused partition gets out of it, the next partition gets into the same situation and so on.

So, what do you do?

We at GemStone Systems think some of the problems can be mitigated by providing two key capabilities:
1) dynamic rebalancing: This is the ability for the system as a whole to adjust itself by relocating buckets (subset of data within a partition) to less loaded nodes. And, doing this in such a fashion so that there are no pauses are introduced.

2) Enough instrumentation within the system to proactively avoid hotspots: GemFire captures a wide range of statistics on the query, update rate on any given partition, the average response times, CPU utilization, GC pause times, overall heap utilization with respect to a configurable threshold to reduce the probability of full GCs, etc. Applications get access to these statistics through a simple API and can then use a API to trigger rebalancing to offload some of the data from "hot" partitions to less loaded ones. Of course, there is no guarantee that the past throughput characterictics observed on any given partition to be representative of te future but the explicit control provides a way for the application developers and adminstrators to dictate what happens. Our sense is that they know best. We even provide a way for the application to simulate a rebalance to see the effect it will have rather than actually doing it.

Explore more on GemFire data fabric

High performance data sharing between C#, Java and C++

2007-07-19T23:21:00.000-07:00

SOAP and High Performance: an oxymoron?

XML messaging using the SOAP protocol has become the lingua franca for interoperability. Though loose coupling and simplicity through a standard text based protocol has its appeal, it is no secret that SOAP isn't suitable for high performance messaging. The simplicity and extensibility of XML/SOAP has generated great deal of interest and resulted in pervasive support across many languages and scripting environments.

This paper evaluates the performance of SOAP for real-time trading and compares it to native binary protocol such as FIX (Financial Information eXchange).

Here is a simple example comparing two text based protocols, one that uses simple 'tag=value' pairs and the other a SOAP envelope. The price you pay for XML and SOAP is obvious.

A FIX message
8=FIX.4.3 9=00000098 35=X 49=ABC 56=XYZ 34=1
52=20021116-10:15:28 262=MYREQ 268=1 279=1
278=FOO.last270=13.42 271=1200 10=185

Equivalent message using FIXML (FIX Markup Language)

The benchmark conducted summaries that SOAP messages are 3.5-4.5 larger than FIX, latency is 2-3 times worse, and encoding/decoding costs are increased by up to nearly 9 times.

Web services based on REST offers an efficient alternative to SOAP based web services. It is a simple HTTP based protocol that allows access to resources through CRUD (Create, Read, Update and Delete) operations. However, this is mostly geared towards simple request-response type synchronous messaging between services. Products that offer reliable, asynchronous messaging semantics on top of REST could be something to watch out for.

Traditional messaging might be popular, but it has its limitations

High performance applications commonly use message oriented middleware products for asynchronous sharing of events and data across heterogeneous applications such as IBM MQ or Tibco Rendezvous.

However, messaging solutions, as the name implies, package, move and deliver messages, one at a time. The receiving applications often need sufficient context to act on the incoming message. Typically the publisher provides this contextual information through message headers increasing the overhead for each message sent. So, for instance, an incoming Order may have to contain sufficient information to identify the associated customer. With the related contextual data arriving in encoded form, such as the Customer ID, the application can only process the Order after it can fetch all the related data (the entire customer record with credit information, for instance). This might require a round trip to the database. Therefore, if all the related data required for processing an incoming message requires external data source access, the processing speed can only go as fast as the weakest link - the database in most circumstances.

Application environments with a high sustained message rate are much more exposed to this "weakest link" problem because enqueing in the messaging system doesn't really help. At some point, either messages have to be discarded or the publisher rate throttled.

To support heterogeneous applications most messaging solutions require the application to construct text based self-describing messages. It becomes the responsibility of the application to use an appropriate encoding format for data such that it can be decoded by the receiving application. Should you use XML for the data format you will experience performance problems similar to those outlined above.

In addition, a highly concurrent application environment with multiple publishers will have an increased probability for race conditions causing data integrity issues. Messaging systems have no inherent capabilities to deal with relationships between messages or ordering of messages across multiple messaging destinations (queues or topics). So, for instance, if Orders (parent) and corresponding LineItems (children) are delivered on two different queues, it is possible for the child data item to be delivered before the parent. Applications would normally be aware of these constraints and route related messages on the same messaging endpoint. It would be advantageous though, if the infrastructure used for transferring messages was aware of this relationship.

Most messaging solutions are really designed to support diverse platforms, multiple language bindings, multiple protocols, flexibility in terms of message reliability, and more. But when it comes to object oriented applications such as C++, Java or .NET applications that want to share data, it can be cumbersome constructing or interpreting message headers, encoding/decoding to/from text based payloads, configuring message delivery options, correlating messages, looking up related data from backend databases before taking action, etc.

Take this MQ example:

qMgr = new MQQueueManager();

/***********************************************************/
/* Open the queue, build the message, and put the message. */
/***********************************************************/
int openOptions = MQC.MQOO_OUTPUT;
MQQueue myQueue = qMgr.accessQueue(args[0], openOptions,
null, null, null);

MQMessage myMessage = new MQMessage();
myMessage.writeString(<>);
myMessage.format = MQC.MQFMT_STRING;

MQPutMessageOptions pmo = new MQPutMessageOptions();
myQueue.put(myMessage, pmo);

/**********************************************************/
/* Close the queue and disconnect from the queue manager. */
/**********************************************************/
myQueue.close();
qMgr.disconnect();

The application developer has to create a QueueManager, fetch queues, define the options to write into the queue, encode their application objects into some text based format, construct Message objects, configure message format, publish into the queue, close the queue and disconnect from the QueueManager. Quite cumbersome, right?

Introducing a Data Fabric

Shared objects/events across heterogeneous applications through main memory caching

The basic idea is as follows:

Instead of requiring application developers to think about messages being the mechanism to share information, why not let them use the same paradigm they use when communicating between various components within the application; simply share domain objects in common data structures like a Map and use native thread based notification services. Extend this concept such that these data structures are distributed and visible to disparate applications. Essentially, applications share data and events with each other through a shared object database that is distributed in nature. Applications perform CRUD operations on a database and receive notifications when changes to data they are interested in occur.

There are some important characteristics that differentiate this database compared to a regular disk-based relational database:

1). It is distributed in nature - applications publish data objects and the database copies/moves these to multiple nodes. Sounds like database replication, except the location of data, the number of copies made, how the data is transported are aspects that are different
2). It is primarily memory based and hence fast
3). It is active in nature and pays attention to complex expressions of interest from subscribing applications and when they change, instantaneously pushing the change to the application.

This kind of data management system is referred to as a data fabric or data grid. A Google search on “Enterprise Data Fabric” will provide an idea of various vendor offerings.

A data fabric or data grid combines the important features and semantics seen in database technology and messaging for high performance applications.

A true data fabric includes the following comprehensive capabilities:

Ø Designed to offer in-memory data access speeds: Data is managed in concurrent data structures (link to concurrent java stuff), primarily in memory with minimum contention issues.
Ø Flexibility in data storage: Data can be stored locally in process, replicated to multiple nodes, partitioned across multiple nodes, maintained both in-memory and in-disk or simply fetched lazily from a back-end data store. Where the data is located or how many copies are maintained becomes a configuration issue and is based on the requirements around performance, high availability and volume of data being managed, yet completely abstracted away from the developer. Data locality is virtualized in a data fabric.

Ø Simple development model: A data fabric offers a simple Map like interface to applications. Applications can simply fetch by key or put domain objects directly into the cache. There is no need to worry about headers, encoding, decoding to some intermediate format, etc.

Ø Scalable: By distributing the data the data fabric uses resources across multiple nodes. Deployments can simply add additional capacity on the fly and automatically get the data rebalanced and handle increasing load (concurrent activity or data volume).

Ø Transactional: The fabric can automatically participate in any ongoing transactions and ensure consistency of data across all the applications sharing the data. For instance, if two applications concurrently decide to create the same customer order and publish this to others, the conflict will be detected and handled appropriately. This is a big difference compared to messaging - with no notion of identity that can be associated with data inherent within the messaging system, it is non trivial to detect conflicts when two applications decide to make the same change at the same time.

Ø Reliable Publish-Subscribe semantics: Applications perform CRUD operations on a local cache and the corresponding event is routed to nodes that subscribe to the data. Data objects can either be synchronously or asynchronously pushed to subscribing applications. Events are pushed to subscribers that contain the new, changed or deleted object(s). The data fabric is intelligent enough to only propagate changes to data objects or its relationships, keeping the underlying network traffic to a minimum.

Ø Querying: Similar to a regular database, the data in-memory can be indexed and queried using SQL like syntax.

Ø Continuous querying: Applications register complex queries, which are queries with complex predicates, joins, etc and, unlike a regular query, are not just executed once. They become resident in the database and give the impression that the query is continuously running. As the data changes the continuous query engine calculates how the result set has changed, pushes the "delta" to the application and merges this with a cache result set on the application node.

Ø Heterogeneous language support: Application objects are automatically serialized in a neutral format within the fabric such that the same object can be de-serialized into an instance of a class in another application written using a different language (with a similar class structure).

For additional information, read more on the architecture of the GemFire Data Fabric

Understanding the data fabric through a use case

Let us look at a financial Trade order processing system that can route orders to an exchange offering the best liquidity.

Here is how an order flows through the system:

- Orders arrive as FIX messages from trading partners and clients to an Order processing application
- The incoming FIX message is validated (authentication and authorization), normalized (combined with other related information such as Trader information, etc) and written into the data fabric as an Order object
- If the incoming order rate is very high, the validation and normalization process itself can be load balanced across multiple nodes
- Orders are then routed to a Trading Strategy engine that uses different algorithms to determine the best time, quantity and exchange to route the order. Execution of the algorithms requires current market data from multiple exchanges (these are the prices for the different securities traded in the exchanges) and also uses various risk metrics in its calculations. The incoming market data is being pushed typically at a very high rate (20,000 or more ticks per second)
- Orders are batched and then routed to an Exchange router, the application responsible for reliably getting the trade order executed on a market exchange
- Finally, there is the entire process of post trade information exchange that we will ignore as it is not relevant for this discussion.

Let us assume that the order processing application and the exchange router application are written in Java and the computation intensive strategy engine is written in C++. These three applications are sharing data and events in real-time.

While it is beyond the scope of this article to explain all the aspects of the data fabric, we will focus on how the fabric enables objects and event sharing.

Shared Object model: With the data fabric, the domain model classes are designed such that they are more or less equivalent for each language in use. One might use an Object modeling tool and generate these classes for both Java and C++.

Serialization framework: Developers implement callbacks similar to Java Externalizable where object fields are written to a stream managed by the serialization framework provided by the data fabric. The framework serializes data into a language neutral wire format that can be consumed by any application that uses the serialization framework at the consuming end.

In this case, the order processing application knows that an order is a complex object comprised of several fields and sub fields. Using the serialization framework, the application writes out the relevant fields out to the stream and puts it out on the wire.

The framework handles translation of primitives across languages and across processor architectures, removing that burden from the end user application. It also preserves complex object relationships across the serialization boundary.
At the receiving end, the incoming bytes are identified as an order object (because the type Id for the payload would be the same across all languages).
Once the object type is identified, the rest of the payload is streamed into the order object on the receiving end (let's say that this is the strategy engine)

Another advantage of this mechanism is that the de-serialization of the payload from the wire (the other half of the serialization framework) results in a ready to use object in the language of the consuming application, which receives a notification about the change and can act on it.

For our strategy engine, the data fabric then fires a notification and hands off the order object (now represented in C++) to the application which then acts on it.

Much of this work can be automated using tools that make it almost trivial to define a data model that is inherently faster and more efficient than traditional serialization mechanisms using XML marshalling, Java or C# serialization.

“Delta” Propagation: When a new incoming FIX message is a “change request” to a pending order, the application merely fetches the Order from the fabric and applies the change to one or more fields – for example, the customer wants to change the “buy” or “sell” volume of the trade, which would be updating a single field in the object. Given that the fabric maintains identity for all objects in the distributed system much like a database, it is now able to calculate the exact “delta” and just transmit that to the strategy engine or any number of connected applications that are listening for the incoming Order requests. By dramatically cutting down on the network and serialization overhead, “Delta” propagation allows the system to scale much better and push much more data through the system than would otherwise be possible.

Conclusion
XML based interoperability works well for a wide class of applications and very well suited to loosely couple applications, but, isn’t well suited for eXtreme Transaction processing applications. Messaging system lack enough context and hence prone to data integrity and consistency issues.

Distributed main memory based architecture such as distributed caching or data fabric (grid) technologies that combines the functions and semantics of database technology and reliable messaging technology may be a better fit.

Heterogeneous applications share a common object domain model - objects published or altered by one application is shared with other applications at memory speeds. The key to fast object sharing is the use of domain classes across these heterogeneous applications that are more or less similar, a native serialization protocol that can detect and dispatch object change "deltas" and an optimized neutral object wire format.

--------

Reliable pub Sub is an integral part of a Data Fabric

2007-04-09T22:15:00.000-07:00

Can the data fabric (or a distributed cache with strong reliability semantics in message distribution) be used as a replacement for traditional messaging? This is a question we frequently get asked and this has been a sweet spot for us at GemStone. Here is some rant on the rationale:

Instead of a distributed main memory based data management solution that merely manages key-value pairs, imagine a solution that can manage objects along with relationships. In other words, ensures data integrity through knowledge about the entire data model. Ahh! like a database.
Now, combine this attribute with the ability to distribute data (replicate, partition, whatever) to many nodes, reliably and provide notifications to subsribing applications. What you get is a ACTIVE, distributed data management system - a system that inherently provides reliable pub-sub along with key semantics of the traditional data management system. Applications simply update a database (objects and relationships), express interest through complex query expressions on the data model and get delivered notifications based on their interest.

In a traditional messaging solution:

Publishing application has to explicitly construct messages, add header information for message identification and to allow subscribers to make sense of the message add enough contexual information. Often, this contextual information takes the form of identifiers (keys that point to the real data in some database).
Most messaging solutions use hub-spoke mechanisms to queue, relay messages to subscribers. i.e. messages hop from application process to some server process managed by the messaging provider and then to the receiving application. Look at many practical applications and you will find that the cost of messaging is not necessarily in the network, but, rather in the CPU costs associated with (de)serialization and all the copying of byte[] that goes on each process space and in the kernel layers.
Receiving application again has to do the reverse - parse headers, deserialize message body and then make sense of the message. To make any decision, often the application has to look up related data from other enterprise repositories such as a relational database, often slowing down the rate at which the entire flow can operate to the slowest link - often the relational database query.

Now, with a data fabric, all applications are sharing a single data model and express interest on the data model through simple, intuitive queries. The underlying fabric is constantly detecting what and how data and relationships are changing and simply sends notifications of the changes to the consuming application.

Note the following advantages with a data fabric:

Application processes are connected to each other in a p2p fashion with direct connections between each. This allows the fabric to avoid unnecessary network hops, dramatically reducing latency and CPU costs associated with message transfer.
As the data is typically held in multiple locations and often replicated to the process space of the consuming application, the publisher doesn't have to send obese messages - applications merely change the data fabric and underneath the hood the right "delta" event gets propagated to the consuming applications.
The receiving application when notified can take immediate business decisions as the contextual information they need is cached right there.

More later ....

Continuous Querying Article on August JDJ issue

2006-09-18T16:08:00.000-07:00

Myself and gideon puttogether a rather simple use case to illustrae the power of the continuous querying technology and reached out to Java Developer Journal. They apparantly found this to be powerful enough - it became the front page feature articel for their August issue.

Java Feature — Building Real-Time Applications with Continuous Query Technology
— The client/server development model prevalent in the mid-1990's resulted in extremely easy-to-build rich GUI applications that interacted directly with a relational database. 4GL tools such as Visual Basic and PowerBuilder let even junior developers visually compose both the presentation and most of the backend data binding. While this made for impressive Rapid Application Development (RAD) productivity, the client/server architecture was severely challenged when dealing with real-time environments where the data changes rapidly and applications require visibility to the correct data at all times. As a result, client applications were forced to poll the database continuously to check for changes.
......

SOA n GRID synergistic?

2006-07-31T21:02:00.000-07:00

A colleague of mine brought this SOA vs GRID aricle to my attention .... ' provides some interesting data on this topic and debunks certain popularly held beliefs..'

---------
Here is my take on the SOA-GRID synergy or lack thereof in practice followed by a discussion on where and if a distributed cache can fit in such architectures …..

Clearly the whole business of service orientation is based on the fact that one discovers a service based on desired operation and other QoS considerations, before binding. Service clients are loosely coupled to service providers by definition.
In the case where service clients come and go, say, like with a portal service that aggregates information from 10 other services, each request could be routed through a intermediary that isolates client from directly connecting to the server. This could be the UDDI registry for instance.

You agree with this, ya? Now, the question is, in what percentage of such SOA architectures will one use a dynamic provisioning service (loosely a GRID)? i.e. every time I use the discovery service it may point me to a different server to get the request fulfilled. In many cases, the expected load might be very predictable or the service provider has to run on specific platform that is my legacy and cannot be deployed into a utility computing center. Folks talk about high availability and hence the need for the Grid, but, I think, that is just bull. All SOA architectures will use a JEE or .Net façade and inherently be HA. So, basically, there is no case for a Grid in this situation.

But, more you look into the future, more likely I want my services to deployed using low cost commodity hardware, be bound by contracts on the QoS (availability and performance), forcing me to think about how my services can constantly be monitored and dynamically reprovisioned on the fly. The power of the Grid - well, actually a sophisticated provisioning and virtualization engine would become a necessity. I am not talking about a typical Grid solution, aka a compute grid - a scheduling engine.

In any case, like the who's who within the OGSA community would say, there is natural synergy between SOA and GRID. SOA is about how to connect services together to realize higher value (integration) and GRID is merely a deployment strategy for such services.

I thought, IBM was already on track to deliver on this promise within their WebSphere platform, circumventing the need for DS or EGO. Ahh! Who am I kidding?
Checkout Oracle's fusion strategy and how automatic provisioning and virtualization is integrated into 10G. Quite impressive.

Now, let me shift gear and see if and where a distributed caching solution fits in a SOA architecture .....

Where do databases and messaging solutions fit in? Just SOA architectures? Only in Compute Grid apps? Duh! everywhere. And, that is my position.
OK, here is the hiccup - again, SOA, by definition is all about getting apps (what do call this now … yeah! Services) talking to one another in a loosely coupled fashion. The service provider can change the underlying technology and behaviour any way it want and remains isolated from other services. I just have to make sure my contract is still intact. So, this cache thing breaks this, one could argue. The cynic would say 'I want a ESB bus that uses XML messaging not a bloody cache'.
Well, gentlemen, I got news for you. I think, all this loosely coupling stuff is baloney. Well, it holds water to a point, but no more.

The fact of the matter is, in real life, application models and yes, data models change and undergo significant enhancements. XML or no XML, your apps, if they need to talk to one another have to change.

You want to solve real problems, create a highly scalable, performant SOA architecture - use a memory based data fabric - An ESB designed for data intensive environments. You still want to use XML - shove data into the Enterprise Service Fabric (ESF) as XML. You don't like using APIs, but rather prefer the slippery SOAP (JAXM) route, use the web services interface.

And, now, fire storms have erupted in california ….

Introducing Distributed Data Fabric for the middle tier

2006-07-20T17:37:00.000-07:00

You should read the prior post first, tell me what you agree with and where you disagree .....

What is the GemFire Enterprise Data Fabric?

GemFire Enterprise Data Fabric is a high performance, distributed operational data management infrastructure that sits between your clustered application processes and back-end data sources to provide very low latency, predictable, high throughput data sharing and event distribution.

It is about operational data management – Unlike a Data warehousing system where terabytes (or petabytes) of data is consolidated from multiple databases for offline data analysis, the EDF is a real-time data sharing facility specifically optimized for working with operational data needed by real-time applications – it is the “now” data, the fast moving data shared across many processes and applications. It is a layer of abstraction in the middle tier that collocates frequently used data with the application and works with backend databases behind the scenes.

Distributed Data Caching – the most important characteristic of the GemFire Data Fabric is that it is fast – many times faster than the traditional disk based database management system, because it is primarily main-memory based. Its engine harnesses the memory and disk across many clustered machines for unprecedented data access rates and scalability. It utilizes highly concurrent main-memory data structures to avoid lock contention and a data distribution layer that avoids redundant message copying, native serialization and smart buffering to ensure messages move from node to node faster than what traditional messaging would provide.

It does this without compromising the availability or consistency of data – a configurable policy dictate the number of redundant memory copies for various data types, storing data synchronously or asynchronously on disk and uses a variety of failure detection models built into the distribution system to ensure data correctness.

Key Database semantics are retained – simple distributed caching solutions provide caching of serialized objects – simple key-value pairs managed in Hashmaps that can be replicated to your cluster nodes. GemFire, provides support for multiple data models across multiple popular languages – data can be managed as Java or C++ objects natively, native XML documents or in SQL tables.

Similar to a Database management system, distributed data in GemFire can be managed in transactions, queried upon, persistently stored and recovered from disk.

Unlike a relational database management system, where all updates are persisted and transactional in nature (ACID), GemFire relaxes the constraints allowing applications to control when and for what kind of data you need total ACID (provide link) characteristics.

For instance, a very high performance financial services application trying to get price updates distributed what is most important is the distribution latency – there is no need for transactional isolation.

The end result is a data management system that spends fewer CPU cycles for managing data and offering higher performance.

Continuous Analytics

With data in the fabric changing rapidly as it is updated by many processes and external data sources it is important for real-time applications to be notified when events of interest are being generated in the fabric. Something a messaging platform is quite suited to do. GemFire data fabric takes this to the next level – applications can now register complex patterns of interest, expressed through SQL queries; Queries that are continuously running. Unlike a database system where queries have to be executed on resident data, here data (or events) is continuously evaluated by a query engine that is aware of the interest expressed by hundreds of distributed client processes.

Reliable messaging and routing

When using a messaging platform, application developers expect reliable and guaranteed Publish-Subscribe semantics. The system has knowledge about active or durable subscribers and provides different levels of message delivery guarantees to subscribers. GemFire EDF incorporates these messaging features on top of what looks like a database to the developer.

Unlike traditional messaging where applications have to deal with piecemeal messages, message construction, incorporating contextual information in messages, managing data consistency across publishers and subscribers, GemFire enables a more intuitive approach - one where applications simply deal with a data model (Object or SQL), subscribe to portions of the data model and publishers make updates to the business objects or relationships. Subscribers are simply notified on the changes to the underlying distributed database.

What makes GemFire EDF unique ?

For the last two decades or so, relational database management systems have taken a "kitchen sink" approach trying to solve any problem associated with data management by bundling this as part of the database engine.

Relational databases are centralized and passive in nature. It does a good job in managing data securely, correctly and persistently, but does not actively push the data to applications that might be interested, now. Second, databases are designed to optimize access to disk and to guarantee the transactional properties at all times. This limits the speed and scalability of a database engine in a highly distributed environment.

Compare this to a data environment where data storage structures are highly optimized for management in memory and concurrent access. To notify applications instantaneously, GemFire immediately routes data to the right node through a data distribution layer that is designed to reduce contention points and avoid unnecessary copies of messages before being transported.

Messaging solutions are most suited for very loosely coupled applications. Though this has its benefits, applications are left with the tough job of managing contextual information to make decisions, often requiring round trips to a database. This eliminates any performance advantages that applications can derive from messaging.

Besides this, often, the asynchronous nature of messages can also result in inconsistencies – the contextual information in the database may not reflect the correct state when the message is received.

GemFire provides an operational data infrastructure that brings data and events into one distributed platform – applications can focus on what matters most – operate on business objects and relationships. Interested applications are immediately notified as and when the data model changes. Data is co-located and accessible at memory speeds and data correctness is always ensured.

Modern day Event Driven Architectures require applications to react to events being pushed at very high rates from multiple streaming data sources, aggregate this data with other slow moving data managed in databases and distribute data and events to many application processes.

Traditional centralized databases simply are not designed to handle this mounting onslaught – what you need is a distributed memory based architecture that can analyze the incoming stream data, combine this with related information and present a consistent and correct data model to the application.

What makes GemFire unique, is this ability to not just analyze fast moving data, but the ability to present a data model (like a database) and route data/events to applications with guaranteed reliability (the semantics of reliable messaging).

Bottomline: Time has come for a middle tier data management layer to manage your operational data and events to enable a new generation of real-time applications with QoS guarantees on performance, continuous availability and scalability. You want to be able to do this, while retaining your investments in existing databases.

"One size fits all" Relational Database is dead. What say u?

2006-07-20T17:27:00.000-07:00

OK, this is my first blog.
I will mostly blog on my professional life and it has a lot to do with high performance distributed computing. Especially, main-memory based distributed data management systems.

Let me begin by taking a look at the traditional relational database.

For the last two decades or so, major Database vendors have taken a "kitchen sink" approach trying to solve any problem associated with data management by bundling this as part of the database engine. Don't take my word on this. Here is how Jim gray's put this "We live in a time of extreme change, much of it precipitated by an avalanche of information that otherwise threatens to swallow us whole. Under the mounting onslaught, our traditional relational database constructs—always cumbersome at best—are now clearly at risk of collapsing altogether" Checkout this article . Adam Bosworth also has some interesting comments in his blog "Where have all the good databases gone"

Alright! What specifically I am talking about?

Consider this for starters:

Take a portal - today you want to build scalable systems using a clustered architecture where you can keep adding processing nodes in the middle tier so you can linearly scale as the number of users keeps going up.

Now, I don't have to tell you how important availability is for folks like these. Always up, always predictable in terms of performance and responsiveness. Enormous attention has been paid to make the middle tier highly available. But, alas, when it comes to the backend data sources, it is left upto the DB vendor.

The traditional database is built to do one thing very well - do a darn a good job in managing data securely, correctly and persistently on DISK. It is centralized by design. Everything is ACID. Ensuring high availability when you are constrained by the strong consistency rules (everything to disk) is very tough to manage. Replicate a database so you can failover to the replica during failure conditions, you are all of sudden left with random inconsistencies to deal with. Try to replicate synchronously, you pay a huge price in terms of performance. You want to provide dramatic scalability, through many replicated databases, you better be ready to live with a very compromised data consistency model.

Ahh! so, something like Oracle RAC is the answer, one might argue? Yes, for a number of use cases. But, here is what one has to consider:
1) You are still dealing with a disk centric architecture where all disks are shared and each DB process instance has equal access to all disks used to manage the tablespaces.
Here are a few important points worth noting:

The unit of data movement in the shared buffer cache across the cluster happens to be a logical data block, which is typically a multiple of 8KB. The design is primarily optimized to make disk IO very efficient. AFAIK, even when requesting or updating a single row the complete block has to be transferred. Compare this to an alternative offering (a distributed main memory object management system) where the typical unit of data transfer is an object entry. In a update heavy and latency sensitive environment, the unit could actually be just the "delta" - the exact change.

In RAC, If accessing a database block of any class does not locate a buffered copy in the local cache, a global cache operation is initiated. Before reading a block from disk, an attempt is made to find the block in the buffer cache of another instance. If the block is in another instance, a version of the block may be shipped. Again, there is no notion of a node being responsible for a block; there could be many copies of the block depending on how the query engine parallelized the data processing. This will be a problem if the update rate is high - imagine distributed locks on coarse grained data blocks in a cluster of 100 nodes; for every single update?

At the end of the day, RAC is designed for everything to be durable to disk and requires careful thought and planning around a high speed private interconnect for buffer cache management and a very efficient cluster file system

Yes, there have been tremendous improvements to the relational database, but, it may not the answer for all our data management needs.

Consider this - How many enterprise class apps are being built that just depend on one database. Amazon's CTO, werner wogel talks about how 100 different backend services are engaged to just construct a single page that you and me see. Talk to a investment bank, events streaming in at mind boggling speeds have to be acted on not by one process, but, by potentially 100's of processes, instantenously. Everything needs to talk to one another in real-time. Consistent, correct data has to be shared across many processes built using heterogenous languages, and connected to heterogenous data sources all in real-time. Ahh! what do call this SOA, ESB, etc.

You can come up with a massive loosely coupled architecture for the connecting everything with everything, but, you have to pay extra attention to how you will manage data efficiently in this massively distributed environment.

Isn't it time for a middle-tier high performance data management layer that can capitalize on the abundant main-memory available on cheap commodity hardware today ? - a piece of middleware technology that can draw data from a multitude of data sources, move data between different applications with tight control on correctness and availability allowing all real-time operations to occur with a high and predictable Quality of Service.

Is this a distributed cache? A messaging solution?

OK, let me sell you what we do ....

Finally, I am blogging

2006-07-20T17:19:00.000-07:00

Boy! this blog thing is new to me. Honestly, I have never blogged nor have I closely watched any blog. This nearly middle aged boy is "old school". Well, about me .... I do a lot of architecture for GemStone on paper :-) as their Chief Architect (www.gemstone.com). But, to my credit, I have been doing distributed data management stuff for, well... too long. I guess, googling me might give you a better idea ...

I guess, I am laggard when it comes to blogging. Better now, than never. It is all about the mindshare.

Serious bloggers, any advise to share ?