Our project team is moving and revising applications to run in the Amazon Web Services (AWS) cloud environment. We plan to use the DynamoDB service recently announced by AWS.
As described at http://aws.amazon.com/dynamodb/ “Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.”
Our team is currently pondering how best to store and access the data for our application within the constraints of DynamoDB.
Amazon CTO Werner Vogels wrote a recent blog entry Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications at http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html. That blog entry provides an overview of the History of NoSQL at Amazon. One key statement there:
· “Each service encapsulates its own data and presents a hardened API for others to use. Most importantly, direct database access to the data from outside its respective service is not allowed. This architectural pattern was a response to the scaling challenges that had challenged Amazon.com through its first 5 years, when direct database access was one of the major bottlenecks in scaling and operating the business.”
Within the realm of NoSQL styles, DynamoDB is best described as an Entity-Attribute-Value data model rather than a Document data model (such as MongoDB). Of course a document object can always be serialized into the Value for an Attribute column of a particular Entity. However the design space appears to favor storing separate data elements within an Entity in their own Attribute columns rather than combining all elements into a single document for the Entity.
Werner Vogels’ blog entry describes the schema model for DynamoDb:
· “DynamoDB tables do not have a fixed schema but instead allow each data item to have any number of attributes, including multi-valued attributes.”
· “In the current release, customers will have the choice of using two types of keys for primary index querying: Simple Hash Keys and Composite Hash Key / Range Keys:
o “Simple Hash Key gives DynamoDB the Distributed Hash Table abstraction. The key is hashed over the different partitions to optimize workload distribution. For more background on this please read the original Dynamo paper.
o “Composite Hash Key with Range Key allows the developer to create a primary key that is the composite of two attributes, a “hash attribute” and a “range attribute.” When querying against a composite key, the hash attribute needs to be uniquely matched but a range operation can be specified for the range attribute: e.g. all orders from Werner in the past 24 hours, all log entries from server 16 with clients IP addresses on subnet 192.168.1.0”
No Alternate Indexes?
So each table in DynamoDB is equivalent to a column family in Cassandra. You cannot sub-index ranges of column names within a hash key, as Cassandra allows, but the Composite Hash Key with Range Key provides the equivalent feature. The main difference is that the “indexed columns” appear as if they reside in separate sub-rows in DynamoDB
Notice that DynamoDB explicitly excludes Alternate Key indexes to locate sets of rows by the data values stored in an attribute column.
What is that all about?
Pat Helland wrote an interesting paper in 2007 during the 2 years he was at Amazon.com: Life Beyond Distributed Transactions: An Apostate’s Opinion at http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf (10 pages) When I searched for this paper again to get the URL, I noticed it is referenced by several other blog writers on the web. An influential little paper indeed…
Rinat Abdulling writes one of the blogs that reference Pat Helland’s paper. The blog entry Infinitely Scalable Systems at http://abdullin.com/wiki/infinitely-scalable-system.html provides an excellent summary of the important assumptions
· “System manages entities (aggregate roots in DDD concept) that are identified by some globally unique id and always fit to a single machine or a small cluster. Examples of entities are: customer, patient, project etc. We can always ensure transactional consistency within a single entity.
· “As system grows, these entities will be redistributed between various machines in order to accommodate the load. We can never know for sure if certain entities are on the same machine or not. Hence, it is impossible to update two logically referenced entities in a single transaction.
· “Architecture should consider and build upon these limitations from the start, using messaging to provide communication between the different entities.
· “Infrastructure can usually guarantee that messages will be delivered at least once. Due to infrastructure limitations and costs, order and single delivery are not always guaranteed. Activities (sagas), that represent partnership of the entity with some specific sender, are responsible for handling such interactions, implementing ordering and reduplication as needed. Idempotent operations generally do not need this.
· “Real-world businesses and operations have been established in the world of slow and unreliable communications (snail mail, telegrams, phones etc.). They already accommodate logic of activities, uncertainty, assumptions and eventually consistent systems in order to stay successful and profitable.”
Mr. Abdulling expands on these concepts in a creative piece titled Space Travel and Infinitely Scalable Solutions at http://abdullin.com/journal/2010/8/22/space-travel-and-infinitely-scalable-solutions.html
His blog also links to other papers by Pat Helland inthis entry: http://abdullin.com/wiki/pat-helland.html
The science fiction analogies in the Space Travel… piece expand on this amusing footnote #15 from Pat Helland’s paper:
· “At the time of writing this paper, funny action at a distance has not been proven and we are limited by the speed of light. There ain’t no such thing as simultaneity at a distance…”
Anyway, the ideas in Mr. Helland’s paper and Mr. Adullin’s blog explain why we only get one index per table in DynamoDB. We need to constrain all update transactions to data associated with a single Entity (also called aggregate root in several Domain Driven Design and Command/Query Responsibility Separation (CQRS) presentations). Alternate Key indexes would span several Entities and could not be updated within the same transaction as the Entity’s data.
So What’s A Poor Developer To Do?
Explicitly manage your alternate access paths within their own Entities.
Treat each Alternate Key value as the primary Hash Key of a supporting Entity whose purpose is to identify the primary keys of other Entities.
Update each alternate access path index entry in its own transaction, in keeping with Pat Helland’s observations. This requires some kind of message bus infrastructure.
We are entering the realm of the Actor programming model as well as the task-oriented command and event sourcing concepts discussed by CQRS advocates. Commands targeted to a primary entity generate additional update task events to secondary related entities.
Ed Anuff provides a good discussion of the indexing concepts in his blog entry Indexing in Cassandra at http://www.anuff.com/2011/02/indexing-in-cassandra.html
As already mentioned DynamoDB differs from Cassandra in its intentional omission of Secondary Indexes. Mr. Anuff uses the term Alternate Indexes to refer both to Cassandra’s Secondary Indexes as well as manually managed indexing techniques. For DynamoDB we can skim over Mr. Anuff’s discussion of Secondary Indexes and focus on his suggestions for “Wide rows” and CF-based Indexes and Inverted Indexes Using Composite Column Names.
Apply a minor conceptual mapping/translation when reading Mr. Anuff’s discussion. When he refers to Cassandra indexes based on the Index Key / Column Name think instead of DynamoDB’s Composite Hash Key with Range Key.
Also accept the fact that data in your self-managed index tables, like any derived cache, can be a little stale. That is the reality of these highly distributed Infinitely Scalable architectures. Remember Mr. Abdulling’s Space Travel… communication analogies. When an index access returns a set of primary keys for other Entity rows, accept that some of those rows may no longer exist, or may no longer contain the searched alternate key value. Filter those rows out as part of your post-index access logic. Also accept that you may miss other recently-updated Entity rows that now contain that alternate key value.
Remember that the real world of business has been operating with these kinds of constraints for centuries…