Technical rants on distributed computing, high performance data management, etc. You are warned! A lot will be shameless promotion for VMWare products

Monday, September 18, 2006

Continuous Querying Article on August JDJ issue

Myself and gideon puttogether a rather simple use case to illustrae the power of the continuous querying technology and reached out to Java Developer Journal. They apparantly found this to be powerful enough - it became the front page feature articel for their August issue.

Java Feature — Building Real-Time Applications with Continuous Query Technology
— The client/server development model prevalent in the mid-1990's resulted in extremely easy-to-build rich GUI applications that interacted directly with a relational database. 4GL tools such as Visual Basic and PowerBuilder let even junior developers visually compose both the presentation and most of the backend data binding. While this made for impressive Rapid Application Development (RAD) productivity, the client/server architecture was severely challenged when dealing with real-time environments where the data changes rapidly and applications require visibility to the correct data at all times. As a result, client applications were forced to poll the database continuously to check for changes.
......

Monday, July 31, 2006

SOA n GRID synergistic?

A colleague of mine brought this SOA vs GRID aricle to my attention .... ' provides some interesting data on this topic and debunks certain popularly held beliefs..'



---------
Here is my take on the SOA-GRID synergy or lack thereof in practice followed by a discussion on where and if a distributed cache can fit in such architectures …..

Clearly the whole business of service orientation is based on the fact that one discovers a service based on desired operation and other QoS considerations, before binding. Service clients are loosely coupled to service providers by definition.
In the case where service clients come and go, say, like with a portal service that aggregates information from 10 other services, each request could be routed through a intermediary that isolates client from directly connecting to the server. This could be the UDDI registry for instance.

You agree with this, ya? Now, the question is, in what percentage of such SOA architectures will one use a dynamic provisioning service (loosely a GRID)? i.e. every time I use the discovery service it may point me to a different server to get the request fulfilled. In many cases, the expected load might be very predictable or the service provider has to run on specific platform that is my legacy and cannot be deployed into a utility computing center. Folks talk about high availability and hence the need for the Grid, but, I think, that is just bull. All SOA architectures will use a JEE or .Net fa├žade and inherently be HA. So, basically, there is no case for a Grid in this situation.

But, more you look into the future, more likely I want my services to deployed using low cost commodity hardware, be bound by contracts on the QoS (availability and performance), forcing me to think about how my services can constantly be monitored and dynamically reprovisioned on the fly. The power of the Grid - well, actually a sophisticated provisioning and virtualization engine would become a necessity. I am not talking about a typical Grid solution, aka a compute grid - a scheduling engine.

In any case, like the who's who within the OGSA community would say, there is natural synergy between SOA and GRID. SOA is about how to connect services together to realize higher value (integration) and GRID is merely a deployment strategy for such services.

I thought, IBM was already on track to deliver on this promise within their WebSphere platform, circumventing the need for DS or EGO. Ahh! Who am I kidding?
Checkout Oracle's fusion strategy and how automatic provisioning and virtualization is integrated into 10G. Quite impressive.

Now, let me shift gear and see if and where a distributed caching solution fits in a SOA architecture .....

Where do databases and messaging solutions fit in? Just SOA architectures? Only in Compute Grid apps? Duh! everywhere. And, that is my position.
OK, here is the hiccup - again, SOA, by definition is all about getting apps (what do call this now … yeah! Services) talking to one another in a loosely coupled fashion. The service provider can change the underlying technology and behaviour any way it want and remains isolated from other services. I just have to make sure my contract is still intact. So, this cache thing breaks this, one could argue. The cynic would say 'I want a ESB bus that uses XML messaging not a bloody cache'.
Well, gentlemen, I got news for you. I think, all this loosely coupling stuff is baloney. Well, it holds water to a point, but no more.

The fact of the matter is, in real life, application models and yes, data models change and undergo significant enhancements. XML or no XML, your apps, if they need to talk to one another have to change.

You want to solve real problems, create a highly scalable, performant SOA architecture - use a memory based data fabric - An ESB designed for data intensive environments. You still want to use XML - shove data into the Enterprise Service Fabric (ESF) as XML. You don't like using APIs, but rather prefer the slippery SOAP (JAXM) route, use the web services interface.

And, now, fire storms have erupted in california ….

Thursday, July 20, 2006

Introducing Distributed Data Fabric for the middle tier

You should read the prior post first, tell me what you agree with and where you disagree .....


What is the GemFire Enterprise Data Fabric?

GemFire Enterprise Data Fabric is a high performance, distributed operational data management infrastructure that sits between your clustered application processes and back-end data sources to provide very low latency, predictable, high throughput data sharing and event distribution.


It is about operational data management Unlike a Data warehousing system where terabytes (or petabytes) of data is consolidated from multiple databases for offline data analysis, the EDF is a real-time data sharing facility specifically optimized for working with operational data needed by real-time applications – it is the “now” data, the fast moving data shared across many processes and applications. It is a layer of abstraction in the middle tier that collocates frequently used data with the application and works with backend databases behind the scenes.

Distributed Data Caching the most important characteristic of the GemFire Data Fabric is that it is fast – many times faster than the traditional disk based database management system, because it is primarily main-memory based. Its engine harnesses the memory and disk across many clustered machines for unprecedented data access rates and scalability. It utilizes highly concurrent main-memory data structures to avoid lock contention and a data distribution layer that avoids redundant message copying, native serialization and smart buffering to ensure messages move from node to node faster than what traditional messaging would provide.

It does this without compromising the availability or consistency of data – a configurable policy dictate the number of redundant memory copies for various data types, storing data synchronously or asynchronously on disk and uses a variety of failure detection models built into the distribution system to ensure data correctness.

Key Database semantics are retained simple distributed caching solutions provide caching of serialized objects – simple key-value pairs managed in Hashmaps that can be replicated to your cluster nodes. GemFire, provides support for multiple data models across multiple popular languages – data can be managed as Java or C++ objects natively, native XML documents or in SQL tables.

Similar to a Database management system, distributed data in GemFire can be managed in transactions, queried upon, persistently stored and recovered from disk.

Unlike a relational database management system, where all updates are persisted and transactional in nature (ACID), GemFire relaxes the constraints allowing applications to control when and for what kind of data you need total ACID (provide link) characteristics.

For instance, a very high performance financial services application trying to get price updates distributed what is most important is the distribution latency – there is no need for transactional isolation.

The end result is a data management system that spends fewer CPU cycles for managing data and offering higher performance.

Continuous Analytics

With data in the fabric changing rapidly as it is updated by many processes and external data sources it is important for real-time applications to be notified when events of interest are being generated in the fabric. Something a messaging platform is quite suited to do. GemFire data fabric takes this to the next level – applications can now register complex patterns of interest, expressed through SQL queries; Queries that are continuously running. Unlike a database system where queries have to be executed on resident data, here data (or events) is continuously evaluated by a query engine that is aware of the interest expressed by hundreds of distributed client processes.

Reliable messaging and routing

When using a messaging platform, application developers expect reliable and guaranteed Publish-Subscribe semantics. The system has knowledge about active or durable subscribers and provides different levels of message delivery guarantees to subscribers. GemFire EDF incorporates these messaging features on top of what looks like a database to the developer.

Unlike traditional messaging where applications have to deal with piecemeal messages, message construction, incorporating contextual information in messages, managing data consistency across publishers and subscribers, GemFire enables a more intuitive approach - one where applications simply deal with a data model (Object or SQL), subscribe to portions of the data model and publishers make updates to the business objects or relationships. Subscribers are simply notified on the changes to the underlying distributed database.

What makes GemFire EDF unique ?

For the last two decades or so, relational database management systems have taken a "kitchen sink" approach trying to solve any problem associated with data management by bundling this as part of the database engine.

Relational databases are centralized and passive in nature. It does a good job in managing data securely, correctly and persistently, but does not actively push the data to applications that might be interested, now. Second, databases are designed to optimize access to disk and to guarantee the transactional properties at all times. This limits the speed and scalability of a database engine in a highly distributed environment.

Compare this to a data environment where data storage structures are highly optimized for management in memory and concurrent access. To notify applications instantaneously, GemFire immediately routes data to the right node through a data distribution layer that is designed to reduce contention points and avoid unnecessary copies of messages before being transported.

Messaging solutions are most suited for very loosely coupled applications. Though this has its benefits, applications are left with the tough job of managing contextual information to make decisions, often requiring round trips to a database. This eliminates any performance advantages that applications can derive from messaging.

Besides this, often, the asynchronous nature of messages can also result in inconsistencies – the contextual information in the database may not reflect the correct state when the message is received.

GemFire provides an operational data infrastructure that brings data and events into one distributed platform – applications can focus on what matters most – operate on business objects and relationships. Interested applications are immediately notified as and when the data model changes. Data is co-located and accessible at memory speeds and data correctness is always ensured.

Modern day Event Driven Architectures require applications to react to events being pushed at very high rates from multiple streaming data sources, aggregate this data with other slow moving data managed in databases and distribute data and events to many application processes.

Traditional centralized databases simply are not designed to handle this mounting onslaught – what you need is a distributed memory based architecture that can analyze the incoming stream data, combine this with related information and present a consistent and correct data model to the application.

What makes GemFire unique, is this ability to not just analyze fast moving data, but the ability to present a data model (like a database) and route data/events to applications with guaranteed reliability (the semantics of reliable messaging).

Bottomline: Time has come for a middle tier data management layer to manage your operational data and events to enable a new generation of real-time applications with QoS guarantees on performance, continuous availability and scalability. You want to be able to do this, while retaining your investments in existing databases.

"One size fits all" Relational Database is dead. What say u?

OK, this is my first blog.
I will mostly blog on my professional life and it has a lot to do with high performance distributed computing. Especially, main-memory based distributed data management systems.

Let me begin by taking a look at the traditional relational database.

For the last two decades or so, major Database vendors have taken a "kitchen sink" approach trying to solve any problem associated with data management by bundling this as part of the database engine. Don't take my word on this. Here is how Jim gray's put this "We live in a time of extreme change, much of it precipitated by an avalanche of information that otherwise threatens to swallow us whole. Under the mounting onslaught, our traditional relational database constructs—always cumbersome at best—are now clearly at risk of collapsing altogether" Checkout this article . Adam Bosworth also has some interesting comments in his blog "Where have all the good databases gone"

Alright! What specifically I am talking about?

Consider this for starters:

Take a portal - today you want to build scalable systems using a clustered architecture where you can keep adding processing nodes in the middle tier so you can linearly scale as the number of users keeps going up.

Now, I don't have to tell you how important availability is for folks like these. Always up, always predictable in terms of performance and responsiveness. Enormous attention has been paid to make the middle tier highly available. But, alas, when it comes to the backend data sources, it is left upto the DB vendor.

The traditional database is built to do one thing very well - do a darn a good job in managing data securely, correctly and persistently on DISK. It is centralized by design. Everything is ACID. Ensuring high availability when you are constrained by the strong consistency rules (everything to disk) is very tough to manage. Replicate a database so you can failover to the replica during failure conditions, you are all of sudden left with random inconsistencies to deal with. Try to replicate synchronously, you pay a huge price in terms of performance. You want to provide dramatic scalability, through many replicated databases, you better be ready to live with a very compromised data consistency model.

Ahh! so, something like Oracle RAC is the answer, one might argue? Yes, for a number of use cases. But, here is what one has to consider:
1) You are still dealing with a disk centric architecture where all disks are shared and each DB process instance has equal access to all disks used to manage the tablespaces.
Here are a few important points worth noting:

  • The unit of data movement in the shared buffer cache across the cluster happens to be a logical data block, which is typically a multiple of 8KB. The design is primarily optimized to make disk IO very efficient. AFAIK, even when requesting or updating a single row the complete block has to be transferred. Compare this to an alternative offering (a distributed main memory object management system) where the typical unit of data transfer is an object entry. In a update heavy and latency sensitive environment, the unit could actually be just the "delta" - the exact change.

  • In RAC, If accessing a database block of any class does not locate a buffered copy in the local cache, a global cache operation is initiated. Before reading a block from disk, an attempt is made to find the block in the buffer cache of another instance. If the block is in another instance, a version of the block may be shipped. Again, there is no notion of a node being responsible for a block; there could be many copies of the block depending on how the query engine parallelized the data processing. This will be a problem if the update rate is high - imagine distributed locks on coarse grained data blocks in a cluster of 100 nodes; for every single update?
  • At the end of the day, RAC is designed for everything to be durable to disk and requires careful thought and planning around a high speed private interconnect for buffer cache management and a very efficient cluster file system

Yes, there have been tremendous improvements to the relational database, but, it may not the answer for all our data management needs.

Consider this - How many enterprise class apps are being built that just depend on one database. Amazon's CTO, werner wogel talks about how 100 different backend services are engaged to just construct a single page that you and me see. Talk to a investment bank, events streaming in at mind boggling speeds have to be acted on not by one process, but, by potentially 100's of processes, instantenously. Everything needs to talk to one another in real-time. Consistent, correct data has to be shared across many processes built using heterogenous languages, and connected to heterogenous data sources all in real-time. Ahh! what do call this SOA, ESB, etc.

You can come up with a massive loosely coupled architecture for the connecting everything with everything, but, you have to pay extra attention to how you will manage data efficiently in this massively distributed environment.

Isn't it time for a middle-tier high performance data management layer that can capitalize on the abundant main-memory available on cheap commodity hardware today ? - a piece of middleware technology that can draw data from a multitude of data sources, move data between different applications with tight control on correctness and availability allowing all real-time operations to occur with a high and predictable Quality of Service.

Is this a distributed cache? A messaging solution?

OK, let me sell you what we do ....

Finally, I am blogging

Boy! this blog thing is new to me. Honestly, I have never blogged nor have I closely watched any blog. This nearly middle aged boy is "old school". Well, about me .... I do a lot of architecture for GemStone on paper :-) as their Chief Architect (www.gemstone.com). But, to my credit, I have been doing distributed data management stuff for, well... too long. I guess, googling me might give you a better idea ...

I guess, I am laggard when it comes to blogging. Better now, than never. It is all about the mindshare.

Serious bloggers, any advise to share ?