Technical rants on distributed computing, high performance data management, etc. You are warned! A lot will be shameless promotion for VMWare products

Sunday, November 20, 2011

HPTS 2011 talk on 'Flexible OLTP in the future'

I recently spoke at HPTS 2011 (High Performance Transaction Systems). If you haven't already you should check out some of the very interesting content on NoSQL ecosystem, future in core density, big data experiences and scars, etc.

Here is the abstract:

Flexible OLTP data models in the future

There has been a flurry of highly scalable data stores and a dramatic spike in the interest level. The solutions with the most mindshare seem to be inspired by Dynamo's (Amazon) eventually consistency model or a data model that promotes nested, self-describing data structures like BigTable from Google. At the same time you see projects within these corporations evolving to architectures like MegaStore and Dremel (Google) where features from the column-oriented data model is blended together with the relational model.

The shift from just highly structured data to unstructured and semistructured content is evident. New applications are being developed or existing applications are being modified at break neck speed. Developers want the data model evolution to be extremely simple and want support for nested structures so they can map to representations like JSON with ease so there is little impedance between the application programming model and the database. Next generation enterprise applications will increasingly work with structured and semi-structured data from a multitude of data sources. A pure relational model is too rigid and a pure BigTable like model has too many shortcomings and cannot be integrated with existing relational databases systems.

In this talk, I present an alternative. We prefer the familiar "row oriented" over "column oriented" approach but still tilt the relational model - mostly the schema definition to support partitioning and colocation, redundancy level and support for dynamic and nested columns.
Each of these extensions will support different desired attributes - partitioning and colocation primitives cover horizontal scaling, availability primitives allow explicit support for replication model and the placement policies (local vs across data centers), dynamic columns will address flexibility for schema evolution (different rows have different columns and added with no DDL requirements) and nested columns that support organizing data in a hierarchy.

We draw inspiration for the data model from Pat helland's 'Life beyond distributed transactions' by adopting entity groups as a first class artifact designers start with, and define relationships between entities within the group (associations based on reference as well as containment). Rationalizing the design around entity groups will force the designer to think about data access patterns and how the data will be colocated in partitions. We then cover why ACID properties and sophiticated querying becomes significantly less challenging to accomplish. There are many ideas around partitioning policies, tradeoffs in supporting transactions and joins across entity groups that are worth discussion.

The idea is to present a model and generate discussion on how to achieve the best of both worlds. Flexible schemas without losing referential integrity, support for associations and the power of SQL. It is ironic that NoSQL databases like Mongodb are getting to be more popular as they begin to add SQL like querying capabilities.

Finally, this summarizes all the different views shared at HPTS.

No comments: