Technical rants on distributed computing, high performance data management, etc. You are warned! A lot will be shameless promotion for VMWare products

Thursday, July 20, 2006

Introducing Distributed Data Fabric for the middle tier

You should read the prior post first, tell me what you agree with and where you disagree .....

What is the GemFire Enterprise Data Fabric?

GemFire Enterprise Data Fabric is a high performance, distributed operational data management infrastructure that sits between your clustered application processes and back-end data sources to provide very low latency, predictable, high throughput data sharing and event distribution.

It is about operational data management Unlike a Data warehousing system where terabytes (or petabytes) of data is consolidated from multiple databases for offline data analysis, the EDF is a real-time data sharing facility specifically optimized for working with operational data needed by real-time applications – it is the “now” data, the fast moving data shared across many processes and applications. It is a layer of abstraction in the middle tier that collocates frequently used data with the application and works with backend databases behind the scenes.

Distributed Data Caching the most important characteristic of the GemFire Data Fabric is that it is fast – many times faster than the traditional disk based database management system, because it is primarily main-memory based. Its engine harnesses the memory and disk across many clustered machines for unprecedented data access rates and scalability. It utilizes highly concurrent main-memory data structures to avoid lock contention and a data distribution layer that avoids redundant message copying, native serialization and smart buffering to ensure messages move from node to node faster than what traditional messaging would provide.

It does this without compromising the availability or consistency of data – a configurable policy dictate the number of redundant memory copies for various data types, storing data synchronously or asynchronously on disk and uses a variety of failure detection models built into the distribution system to ensure data correctness.

Key Database semantics are retained simple distributed caching solutions provide caching of serialized objects – simple key-value pairs managed in Hashmaps that can be replicated to your cluster nodes. GemFire, provides support for multiple data models across multiple popular languages – data can be managed as Java or C++ objects natively, native XML documents or in SQL tables.

Similar to a Database management system, distributed data in GemFire can be managed in transactions, queried upon, persistently stored and recovered from disk.

Unlike a relational database management system, where all updates are persisted and transactional in nature (ACID), GemFire relaxes the constraints allowing applications to control when and for what kind of data you need total ACID (provide link) characteristics.

For instance, a very high performance financial services application trying to get price updates distributed what is most important is the distribution latency – there is no need for transactional isolation.

The end result is a data management system that spends fewer CPU cycles for managing data and offering higher performance.

Continuous Analytics

With data in the fabric changing rapidly as it is updated by many processes and external data sources it is important for real-time applications to be notified when events of interest are being generated in the fabric. Something a messaging platform is quite suited to do. GemFire data fabric takes this to the next level – applications can now register complex patterns of interest, expressed through SQL queries; Queries that are continuously running. Unlike a database system where queries have to be executed on resident data, here data (or events) is continuously evaluated by a query engine that is aware of the interest expressed by hundreds of distributed client processes.

Reliable messaging and routing

When using a messaging platform, application developers expect reliable and guaranteed Publish-Subscribe semantics. The system has knowledge about active or durable subscribers and provides different levels of message delivery guarantees to subscribers. GemFire EDF incorporates these messaging features on top of what looks like a database to the developer.

Unlike traditional messaging where applications have to deal with piecemeal messages, message construction, incorporating contextual information in messages, managing data consistency across publishers and subscribers, GemFire enables a more intuitive approach - one where applications simply deal with a data model (Object or SQL), subscribe to portions of the data model and publishers make updates to the business objects or relationships. Subscribers are simply notified on the changes to the underlying distributed database.

What makes GemFire EDF unique ?

For the last two decades or so, relational database management systems have taken a "kitchen sink" approach trying to solve any problem associated with data management by bundling this as part of the database engine.

Relational databases are centralized and passive in nature. It does a good job in managing data securely, correctly and persistently, but does not actively push the data to applications that might be interested, now. Second, databases are designed to optimize access to disk and to guarantee the transactional properties at all times. This limits the speed and scalability of a database engine in a highly distributed environment.

Compare this to a data environment where data storage structures are highly optimized for management in memory and concurrent access. To notify applications instantaneously, GemFire immediately routes data to the right node through a data distribution layer that is designed to reduce contention points and avoid unnecessary copies of messages before being transported.

Messaging solutions are most suited for very loosely coupled applications. Though this has its benefits, applications are left with the tough job of managing contextual information to make decisions, often requiring round trips to a database. This eliminates any performance advantages that applications can derive from messaging.

Besides this, often, the asynchronous nature of messages can also result in inconsistencies – the contextual information in the database may not reflect the correct state when the message is received.

GemFire provides an operational data infrastructure that brings data and events into one distributed platform – applications can focus on what matters most – operate on business objects and relationships. Interested applications are immediately notified as and when the data model changes. Data is co-located and accessible at memory speeds and data correctness is always ensured.

Modern day Event Driven Architectures require applications to react to events being pushed at very high rates from multiple streaming data sources, aggregate this data with other slow moving data managed in databases and distribute data and events to many application processes.

Traditional centralized databases simply are not designed to handle this mounting onslaught – what you need is a distributed memory based architecture that can analyze the incoming stream data, combine this with related information and present a consistent and correct data model to the application.

What makes GemFire unique, is this ability to not just analyze fast moving data, but the ability to present a data model (like a database) and route data/events to applications with guaranteed reliability (the semantics of reliable messaging).

Bottomline: Time has come for a middle tier data management layer to manage your operational data and events to enable a new generation of real-time applications with QoS guarantees on performance, continuous availability and scalability. You want to be able to do this, while retaining your investments in existing databases.

No comments: