Technical rants on distributed computing, high performance data management, etc. You are warned! A lot will be shameless promotion for VMWare products

Monday, April 09, 2007

Reliable pub Sub is an integral part of a Data Fabric

Can the data fabric (or a distributed cache with strong reliability semantics in message distribution) be used as a replacement for traditional messaging? This is a question we frequently get asked and this has been a sweet spot for us at GemStone. Here is some rant on the rationale:

Instead of a distributed main memory based data management solution that merely manages key-value pairs, imagine a solution that can manage objects along with relationships. In other words, ensures data integrity through knowledge about the entire data model. Ahh! like a database.
Now, combine this attribute with the ability to distribute data (replicate, partition, whatever) to many nodes, reliably and provide notifications to subsribing applications. What you get is a ACTIVE, distributed data management system - a system that inherently provides reliable pub-sub along with key semantics of the traditional data management system. Applications simply update a database (objects and relationships), express interest through complex query expressions on the data model and get delivered notifications based on their interest.

In a traditional messaging solution:

  1. Publishing application has to explicitly construct messages, add header information for message identification and to allow subscribers to make sense of the message add enough contexual information. Often, this contextual information takes the form of identifiers (keys that point to the real data in some database).
  2. Most messaging solutions use hub-spoke mechanisms to queue, relay messages to subscribers. i.e. messages hop from application process to some server process managed by the messaging provider and then to the receiving application. Look at many practical applications and you will find that the cost of messaging is not necessarily in the network, but, rather in the CPU costs associated with (de)serialization and all the copying of byte[] that goes on each process space and in the kernel layers.
  3. Receiving application again has to do the reverse - parse headers, deserialize message body and then make sense of the message. To make any decision, often the application has to look up related data from other enterprise repositories such as a relational database, often slowing down the rate at which the entire flow can operate to the slowest link - often the relational database query.
Now, with a data fabric, all applications are sharing a single data model and express interest on the data model through simple, intuitive queries. The underlying fabric is constantly detecting what and how data and relationships are changing and simply sends notifications of the changes to the consuming application.

Note the following advantages with a data fabric:

  1. Application processes are connected to each other in a p2p fashion with direct connections between each. This allows the fabric to avoid unnecessary network hops, dramatically reducing latency and CPU costs associated with message transfer.
  2. As the data is typically held in multiple locations and often replicated to the process space of the consuming application, the publisher doesn't have to send obese messages - applications merely change the data fabric and underneath the hood the right "delta" event gets propagated to the consuming applications.
  3. The receiving application when notified can take immediate business decisions as the contextual information they need is cached right there.

More later ....