Technical rants on distributed computing, high performance data management, etc. You are warned! A lot will be shameless promotion for VMWare products

Showing posts with label VMWare. Show all posts
Showing posts with label VMWare. Show all posts

Tuesday, December 13, 2011

SQLFire 1.0 - the Data Fabric Sequel


This week we finally reached GA status for VMWare vFabric SQLFire  - a memory-optimized distributed SQL database delivering dynamic scalability and high performance for data-intensive modern applications. 


In this post, I will highlight some important elements in our design and draw out some of our core values.


The current breed of popular NoSQL stores promote different approaches to data modelling, storage architectures and consistency models to solve the scalability and performance problems in relational databases. The overarching messages in all of them seems to suggest that the core of the problem with traditional relational databases is SQL. 
But, ironically, the core of the scalability problem has little to do with SQL itself - it is the manner in which the traditional DB manages disk buffers, manages its locks and latches through a centralized architecture to preserve strict ACID properties that represents a challenge. Here is a slide from research at MIT and Brown university on where the time is spent in OLTP databases. 




Design center
With SQLFire we change the design center in a few interesting ways:
1) Optimize for main memory: we assume memory is abundant across a cluster of servers and optimize the design through highly concurrent data structures all resident in memory. The design is not concerned with buffering contiguous disk blocks in memory but rather manages application rows in memory hashmaps in a form so it can be directly consumed by clients. Changes are synchronously propagated to redundants in the cluster for HA. 


2) Rethink ACID transactions: There is no support for strict serializable transactions but assume that most applications can get by with simpler "read committed" and "repeatable read" semantics. Instead of worrying about "read ahead" transaction logs on disk, all transactional state resides in distributed memory and uses a non-2PC commit algorithm optimized for small duration, non-overlapping transactions. The central theme is to avoid any single points of contentions like with a distribtued lock service. See some details here.


3) "Partition aware DB design": Almost every single high scale DB solution offers a way to linearly scale by hashing keys to a set of partitions. But, how do you make SQL queries and DML scale when they involve joins or complex conditions? Given that distributed joins inherently don't scale we promote the idea that the designer should think about common data access patterns and choose the partitioning strategy accordingly. To make things relatively simple for the designer, we extended the DDL (Data definition language in SQL) so the designer can specify how related data should be colocated ( for instance 'create table Orders (...) colocate with Customer' tells us that the order records for a customer should always be colocated onto the same partition). The colocation now makes join processing and query optimization a local partition problem (avoids large transfers of intermediate data sets). The design assumes classic OLTP workload patterns where vast majority of individual requests can be pruned to a few nodes and that the concurrent workload from all users is spread across the entire data set (and, hence across all the partitions). Look here for some details.


4) Shared nothing logs on disk: Disk stores are merely "append only" logs and designed so that application writes are never exposed to the disk seek latencies. Writes are synchronously streamed to disk on all replicas. A lot of the disk store design looks similar to other NoSQL systems - rolling logs, background/offline compression, memory tables pointing to disk offsets, etc. But, the one aspect that represents core IP is all around managing consistent copies on disk in the face of failures. Given that distributed members can come and go, how do we make sure that the disk state a member is working with is the one I should be working with? I cover our "shared nothing disk architecture" in lot more detail here.


5) Parallelize data access and application behavior: We extend the classic stored procedure model by allowing applications to parallelize the procedure across the cluster or just a subset of nodes by hinting the data the procedure is dependent on. This applicaton hinting is done by supplying a "where clause" that is used to determine where to route and parallelize the execution. Unlike traditional databases, procedures can be arbitrary application Java code (you can infact embed the cluster members in your Spring container) and run collocated with the data. Yes, literally in the same process space where the data is stored. Controversial, yes, but, now your application code can do a scan as efficiently as the database engine.


6) Dynamic rebalancing of data and behavior: This is the act of figuring out what data buckets should be migrated when new capacity (cluster size grows) is allocated (or removed) and how to do this without causing consistency issues or introducing contention points for concurrent readers and writes. Here is the patent that describes some aspects of the design. 




Embedded or a client-server topology
SQLFire supports switching from the classic client-server (your DB runs in its own processes) topology to embedded mode where the DB cluster and the application cluster is one and the same (for Java apps). 
We believe the emdedded model will be very useful in scenarios where the data sets are relatively small. It simplifies deployment concerns and at the same time provides significant boost in performance when replicated tables are in use.

All you do is change the DB URL from
'jdbc:sqlfire://server_Host:port' to 'jdbc:sqlfire:;mcast-port=portNum' and now all your application processes that use the same DB URL will become part of a single distributed system. Essentially, the mcast-port port identifies a broadcast channel for membership gossiping. New servers will automatically join the cluster once authenticated. Any replicated tables will automatically get hosted in the new process and partitioned tables could get rebalanced and share some of the data with the new process. All this is abstracted away from the developer. 
As far as the application is concerned, you just create connections and execute SQL like with any other DB.





How well does it perform and scale? 
Here are the results of a simple benchmark done internally using commodity (2 CPU) machines showcasing linear scaling with concurrent user load. I will soon augment this with more interesting workload characterization. The details are here.


Comparing SQLFire and GemFire


Here is a high level view into how the two products compare. I hope to add a blog post that provides specific details on the differences and use cases where one might apply better than the other.




SQLFire benefits from the years of commercially deployed production code found in GemFire.  SQLFire adds a rich SQL engine with the idea that now folks can manage operational data primarily in memory, partitioned across any number of nodes and with a disk architecture that avoids disk seeks.  Note the two offerings, SQLFire and GemFire, are completely unique products and deployed separately




As always, I would love to get your candid feedback (link to our forum). I assure you that trying it out is very simple - just like using Apache Derby or H2. 


Get to the download, docs and quickstart all from here. The developer license is perpetual and works on upto 3 server nodes.





Monday, September 19, 2011

What is new in vFabric GemFire 6.6?


GemFire 6.6 was released (Sept 2011) as part of the new vFabric 5.0 product suite and, it represents a big step along the following important dimensions:
  1. developer productivity
  2. more DBMS like features 
  3. better scaling features

Here are some highlights on each dimension:

Developer productivity: Introduced a new serialization framework called PDX (stands for Portable Data eXchange and not my favorite airport).
PDX is a framework that provides a portable, compact, language neutral and versionable format for representing object data in GemFire. It is proprietary but designed for high efficiency. It is comparable to other serialization frameworks like apache Avro, Google protobuf ,etc.
Alright. I realize the above definition is a mouth full :-)

Simply put, the framework supports versioning allowing apps using older class versions to work with apps with newer versions of the domain classes and vice versa, provides a format and type system for interop between the various languages and an API so server side application code can operate on objects without requiring the domain classes (i.e. no deserialization).
The type evolution has to be incremental - this is the only way to avoid data loss or exceptions.
The raw serialization performance is comparable to Avro, protobuf but is much more optimized for distribution and operating in a GemFire cluster. The chart below is the result of a open source benchmark on popular serialization frameworks. The details are available here. 'Total' represents the total time required to create, serialize and then deserialize. See the benchmark description for details.


You can either implement serialization callbacks (for optimal performance) or simply use the built in PDXSerializer (reflection based today). Arguably, the best part of the framework is its support for object access in server side functions or callbacks like listeners without requiring the application classes. You can dynamically discover the fields and nested objects and operate on these using the PDX API. On the application client that has the domain classes the same PDXInstance is automatically turned into the domain object.

We introduced a new command shell called gfsh (pronounced "gee - fish" ) - a command line tool for browsing and editing data stored in GemFire. Its rich set of Unix-flavored commands allows you to easily access data, monitor peers, redirect outputs to files, and run batch scripts. This is an initial step towards a more complete tool that can provision, monitor, debug, tune and administer a cluster as a whole. Ultimately, we hope to advance the gfsh scripting language making integration of GemFire deployments into cloud like virtualized environments a "breeze".

More DBMS like: 
Querying and Indexing
we added several features to our query engine - query/index on hashmaps, bind parameters from edge clients, OrderBy support for partitioned data regions, full support for LIKE predicates and being able to index regions that overflow to disk.

Increasingly we see developers wanting to decouple the data model in GemFire from the class schema used within their applicatons. Even though PDX offers an excellent option, we also see developers mapping their data into "self describing" hashmaps in GemFire. The data store is basically "schema free" and allows many application teams to change the object model without impacting each other. Given a simple KV storage model in GemFire this has never been an issue except for querying. Now, not only can you store maps, you can index keys within these Hashmaps and execute highly performant queries.
Do take note that the query engine now natively understands the PDX data structures with no need for application classes on servers.

We expanded distributed transactions by allowing edge clients to initiate or terminate transactions. No need to invoke a server side function for transactions. We also added a new JCA resource adapter that supports participation in externally coordinated transactions as a "Last resource".

Finally, on the scaling dimension:
You are probably aware that GemFire's shared nothing persistence relies on append-only operation logs to provide very high write throughput. There are no additional Btree data files to maintain like in a traditional database system. The tradeoff with this design is cluster recovery speed. One has to walk through the logs to recover the data back into memory and the time for the entire cluster to bootstrap from disk is proportional to the volume of data (and inversely proportional to the cluster count). And, this can be long (put mildly) with large data volumes even though you can parallelize the recovery across the cluster. To minimize this recovery delay, the 6.6 persistence layer now also manages "key files" on disk. We simply recover the keys back into memory and lazily recover the data giving recovery in general a significant performance boost.

Prior to 6.6, GemFire randomly picked a different host to manage redundant copies for partitioned data regions. Often, customers provision multiple racks and want their redundant copy to always be stored on a different physical rack. Occasionally, we also see customers wanting to store their redundant data on a different site. We added support for "redundancy zone" in partitioned region configuration allowing users to identify one or more redundancy zones (could be racks, sites, etc). GemFire will automatically enforce managing redundants in different zones.

Everything mentioned happens to be more of a prelude. The list of enhancements is much longer and is  documented here.

The product documentation is available here.
You can start discussions here.

Would love to hear your thoughts. 

Thursday, May 06, 2010

SpringSource/VMWare acquires GemStone

Here is some specific details on the potential technology synergies ....

Synergy with SpringSource


What makes the integration very synergistic and complementary are the areas of focus for the two companies : GemStone/GemFire has primarily focused on getting the infrastructure for clustering and managing data in the middle tier to be extremely reliable and offer tremendous flexibility; SpringSource focus has been to offer a ubiquitous programming model (through the Spring framework) that is simple to adopt and quite open where a myriad of products and technologies can be seamlessly integrated and a light weight runtime environment through the enterprise class TC server and a management environment enabled by HypericHQ. In a sense, the GemFire integration brings first class data management and clustering to the run time - a key component enabling "extreme scale" application deployment.


GemStone, as well as the merged entity is totally committed to support heterogeneous access to the GemFire data fabric with continued support for Java, C++ and .NET. In fact, we are exploring opportunities to integrate with Spring.NET and continue on our commitments to simplify the programming models for non-Java environments.

GemStone management and the SpringSource/VMWare management remains committed to make sure we deliver on the roadmap and future extensions to the platform that have already been discussed with customers. In fact, we will go well beyond our current commitments by leveraging the Spring framework and integration with a multitude of Spring modules to provide a much simpler configuration and development model for our customers.


How might we leverage the Spring framework?

The spring framework is all about choice for enterprise Java applications. We now enable more choice by offering a clustered data management solution that can be used in multiple ways:


1) As a transparent L2 Cache: Spring application using Hibernate will be able to plugin a L2 Cache that can intercept traffic to/from a backend database for scaling and performance reasons. Note that GemFire already supports Hibernate based L2 Cache plugins but now this effort will take increased focus. On the .NET side of things, we will accelerate support for Spring.NET applications using nHibernate.

2) AOP cache: We will be able to offer sophisticated AOP caching interceptors that can transparently use a highly scalable cache.

3) Parallel Data aware method invocations: Spring bean service invocations could transparently make use of the GemFire data-aware parallel function execution capabilities where behavior execution can happen in parallel, operate on localized data sets and go through a "aggregation" phase to produce a final result.

4) Session state management: We already provide an abstraction layer for session state management on top of GemFire today. The intent would be to include this capability as an integral part of the platform. The highlights of this add-on include the ability to handle very large object graphs efficiently, dynamic partitioning and load balancing, visibility of session state across multiple clusters (WAN), HA and the ability to maintain a "edge" cache that maximizes memory utilization through a heap LRU based eviction algorithm.

We are exploring options to enable quick integration of the data fabric into existing Spring applications through enhancements to the Eclipse plugins - for instance, the configuration of the cache regions and deployment topologies will be simplified. We will also investigate how we can leverage Spring Roo (next generation rapid application development tool) where a developer will be able to build a Java application with integrated caching in minutes.

Integration with Spring Modules

Unlike some of the other distributed caching options available in the market, GemFire natively supports memory based, reliable event notifications. Applications can subscribe to data in the fabric through "continuous queries" and can receive in-order, reliable notifications when the data of interest changes. Spring Integration extends the Spring framework into the messaging domain and provides a higher level of abstraction so business components are further isolated from the infrastructure and developers are relieved of complex integration responsibilities. We are exploring techniques to leverage this simple and intuitive API so applications are abstracted away from dealing with GemFire specific notification APIs.


Spring Batch enables extremely high-volume and high performance batch jobs though optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the spring framework in a highly scalable manner to process significant volumes of information. Integration with the GemFire function service will now mean developers can develop Spring batch applications but behind the scenes leverage the parallelism and "data aware" routing capabilities built into GemFire.

There are many integration possibilities all of which will make the job of the developer integrating with a data fabric/grid significantly simpler.


GemFire in the cloud

Our offering will expand the VMWare strategy to deliver 'Platform as a Service' solutions over time. The already powerful arsenal includes VMWare VSphere as the cloud operating system, the Spring framework as the enterprise application development model and SpringSource TCServer as a lightweight application platform. Now, GemStone extends the capabilities with GemStone's GemFire as the scalable, elastic data platform.


While a good case is being offered by VMWare on the current VMWare products positioned for the cloud, I will focus on the rationale for GemFire as a data platform for the cloud.

One of the big advantages with applications deployed in a cloud environment is to tap into additional capacity 'just in time' based on demand changes. The underlying platform needs to detect and respond by exhibiting elastic characteristics. The additional capacity could be provisioned through virtual or physical machines spread across subnets or perhaps even across data centers. This means application behavior as well the associate state (data) needs to horizontal scale through migration. The behavior migration is well taken care of by modern day clustered application servers, but it is lot more challenging with data. The traditional relational database is rooted in a design where the focus is optimizing disk IO and maintaining ACID properties. The frame of reference was quite different - data was centralized and extensive locking/latching techniques used to preserve the ACID transaction properties. The design did not account for data to be managed across a large cluster of heterogeneous machines. Several clustered databases today offer a shared nothing architecture at the database engine layer but fall short at the storage layer (where it is shared everything). For elasticity, we believe the design has to change along the following two important dimensions:

1) distribution orientation - efficient distribution methods to move data around a large network efficiently without loss of consistency

2) memory orientation - when demand spikes it is lot faster and easier to move data between memory segments across machines. GemFire manages data by horizontally partitioning the data across any number of machines, and initiates automatic data rebalancing when additional capacity is added to its cluster. Unlike common database architectures, GemFire primarily manages data in memory and leaves the persistence to disk as a choice for the developer. Besides being more efficient to simply manage ephemeral state such as session state only in memory, regulations in some environments may mandate that nothing be stored on disk.

We also know that applications deployed on the cloud will be subject to higher levels of SLAs - continuous availability and very predictable response times. Data stored in GemFire is typically always synchronously copied to one or more machines often also redundantly stored on disk and even asynchronously copied across data centers in case the entire data center goes down. Again, by allowing the data to be primarily managed in memory, the cost of maintaining data redundantly is considerably reduced. GemFire's built in mechanism to carry out continuous instrumentation and potential integration with cloud provisioning environments (vCloud API) permits apriori detection of changing load patterns to initiate data rebalancing without any operator intervention.

Highly parallelizable, computationally intensive applications such as large scale web applications, or analytic applications (such as risk analysis in finance) are ideal candidates for cloud deployments. A lot of these computational intensive tasks tend to be data intensive also. Spring enabled services can now be parallelized through the use of GemFire's data-aware parallel function service where the underlying application behavior can now be parallelized and executed on the nodes carrying the data sets required by each parallel activity. This data localization will dramatically increase the throughput and reduce costs with little out-of-process data access. Instead of developing applications using the custom GemFire APIs, application development could rely on the open spring community driven modules such as Spring Batch simplifying the development task.

We also believe the combination of SpringSource/VMWare infrastructure and GemFire will enable customers to implement "cloud bursting" strategies where on-premises stateful services can much more easily expand to a in-house private or even a public cloud.


Vision of a first class middle tier data management platform

GemFire started off as a pure main memory distributed cache where transactional updates were immediately persisted to the database of record. In recent years, with the addition of capabilities such as reliable asynchronous writes to the database, data partitioning with dynamic rebalancing and wide area network data management the product now is often used as the high performance store with data written to the database either in batches, at the end of the business day or a batch run. The database is being relegated as the archival store for integration purposes.

We are now adding support for parallel shared nothing disk storage layer in the product. Unlike other cluster database management system designs where the disks are shared across many distributed processes, each member of the cluster can persist in parallel to their local disks. Transactions do not result in disk flushes but rather allow the IO scheduler to time the block writes to disk dramatically increased disk write throughput. Any data loss risks due to sudden failures at a hardware level are mitigated by having multiple nodes write in parallel to disk. In fact, it is assumed that hardware will fail, especially in large clusters and data centers and software needs to take into account such failures. The system is designed to recover in parallel from disk and to guarantee data consistency when data copies on disk don't agree with each other.

It is our strong belief that with the proliferation of highly distributed middle tier architectures, a data tier that is colocated with the application tier, is primarily clustered and is used to manage ephemeral data will substantially change the architectures for all enterprise applications that demand scaling and predictability.

It is with this belief that we built the new SQLFabric product - same underpinnings as GemFire but uses SQL as the interface. We took this step in spite of the recent push by the "no sql" database alternatives for the cloud. Our decades of database implementation and research experience does not implicate SQL as a query language when it comes to the desired characteristics of elasticity and performance. This is simply a matter of design choices made by SQL database vendors based on a frame of reference that is no longer valid. In other words, there is nothing wrong with SQL as a language but rather it is the design of disk oriented, centralized databases that is sub-optimal. The premise behind SQLFabric is to capitalize on the power of SQL as an expressive, flexible, very well understood query language, but alter the design underpinnings in common databases for scalability and high performance. By providing ubiquitous interfaces like JDBC and ADO.NET the product becomes significantly easier to adopt and integrates well into existing eco-systems.