Where We Left Off

In my previous post, I explained that our project team is writing applications to run in the Amazon Web Services (AWS) cloud environment and we plan to use the DynamoDB service recently announced by AWS.

I summarized the available index mechanisms in DynamoDB and some of the motivations for the explicit omission of alternate key indexes.

As part of my own education, I chose to explore some data storage model options that might be relevant when using DynamoDB.

This exercise references the News Goggles imaginary web application described elsewhere on this blog site.

The Basis of Domain Data Modeling

Regardless of the data storage tools used by a software application, the existing business domain provides structure to the relevant data.

The third normal form concepts developed with relational database theory continue to provide a relevant method to describe the data for a business domain. The essence of third normal form is that data can be organized into Entities and that these Entities participate in various Relationships with each other. Detailed facts can be modeled as Attributes. Each Attribute can be assigned directly to an Entity; in addition, an Attribute can be assigned to a Relationship.

In grammar terms:

·        Entities are analogous to Nouns.

·        Relationships are analogous to Verbs.

·        Attributes are analogous to Adjectives (when assigned to Entities) or Adverbs (when assigned to Relationships).

Relational Entity/Relationship Model

Here is one possible relational data model for the News Goggles application. The model is presented as a set of prototype tables with representative sample data values included.

Notations:

·        PK = Primary Key

·        SA = System-Assigned

·        UA = User-Assigned

·        FK = Foreign Key

·        NN = No Nulls

Colors:

·        Black = Entity (also indicated by PK notation)

·        Red = Relationship (also indicated by FK notation)

·        Blue = Attribute (also indicated by lack of PK or FK notation)

Here are two minor Entities in our model.

Provider

Provider Code

PK, UA

$ESPN

$FOX

$CNN

… (etc.) …

Topic

Topic Code

PK, UA

#sports

#baseball

#detroit

… (etc.) …

The Article Entity has a one-to-many foreign-key relationship to the Provider of each Article, and a Subject Attribute.

Article

Article Id

Provider Code

Subject

PK, SA

FK, NN

NN

1001

$ESPN

Tigers sign Prince Fielder.

1002

$ESPN

Peyton Manning home team locker to be used by arch-nemesis Tom Brady?

2001

$FOX

Newt nabs South Carolina.

… (etc.) …

… (etc.) …

. . . (etc.) …

This table defines a many-to-many Relationship between Articles and Topics. It has a two-column primary-key consisting of the two related foreign-key values. It has no modifying Attributes of the Relationship.

Article-Topic

Article Id

Topic Code

PK, FK

PK, FK

1001

#sports

1001

#baseball

1001

#detroit

1002

#sports

1002

#indianapolis

1002

#football

… (etc.) …

… (etc.) …

This Entity represents our Subscribers. We include a Hashed Password Attribute to demonstrate a column that may contain null values for some rows.

Subscriber

Subscriber Code

Hashed Password

PK, UA

 

@Alice

Dkjaafjjkjemded

@Bob

 

@Chris

Afdjajewevwwb

@Pat

 

… (etc.) …

.  . . (etc.) …

Here are additional many-to-many Relationship tables to store Subscriber Preferences.

Subscriber-Likes-Topic

Subscriber Code

Topic Code

PK, FK

PK, FK

@Alice

#politics

@Alice

#sports

@Bob

#politics

… (etc.) …

… (etc.) …

Subscriber-Dislikes-Topic

Subscriber Code

Topic Code

PK, FK

PK, FK

@Bob

#democrats

… (etc.) …

… (etc.) …

Subscriber-Likes-Provider

Subscriber Code

Provider Code

PK, FK

PK, FK

@Alice

$CNN

@Alice

$MSNBC

@Pat

$FOX

… (etc.) …

… (etc.) …

Subscriber-Dislikes-Provider

Subscriber Code

Provider Code

PK, FK

PK, FK

@Bob

$CNN

@Chris

$FOX

… (etc.) …

… (etc.) …

These tables define three-way Relationships to store each Subscriber’s special case exception preferences.

Subscriber-Allows-Provider-Topic

Subscriber Code

Provider Code

Topic Code

PK, FK

PK, FK

PK, FK

@Bob

$CNN

#detroit

… (etc.) …

… (etc.) …

… (etc.) …

Subscriber-Blocks-Provider-Topic

Subscriber Code

Provider Code

Topic Code

PK, FK

PK, FK

PK, FK

@Pat

$FOX

#opinion

… (etc.) …

… (etc.) …

… (etc.) …

DynamoDB Data Model

With NoSQL databases like DynamoDB, our storage constraints no longer require second and third normal form rules to be followed in all situations. We still choose to start with a normalized data model since it clearly defines the essential business domain structure for our data.

Let’s explore some possible ways to adapt our relational model to a column-store data model for the News Goggles application for use with DynamoDB.

New Notations:

·        HK = Hash Key (only supports exact match value lookups)

·        RK = Range Key (supports sort-order based range and prefix lookups)

·        value1 | value2 = Composite Key Value

·        <prefix> = Literal prefix value for partitioning Range Key values

·        [ hash-value // range-value ] = Composite Foreign Key Value

·        { value1, value2, …, valueN } = Set-based multi-values stored within an attribute column.

New Colors:

·        Blue = Attribute stored at “primary source of truth”

·        Green Underlined = Derived copy of attribute stored at “secondary source of truth”

The Provider and Topic are minor entities in the example data with low cardinality of expected values. One approach is to create an artificial Minor Entity table for storing these.

Storage of the actual Provider Code and Topic Code values in the Range Key allows exact lookups for a given Provider or Topic; it also allows scanning all Provider or Topic rows without scanning the entire Minor Entities table.  (Remember that full table scans can be expensive in DynamoDB, as they require a full map/reduce operation involving at least one replica of each storage segment in the cluster.)

We can revisit this decision later and give each of these entities their own column-store tables, but let’s consider this approach for now.

Minor Entities

HK

RK

Entity Type

Entity Code

Provider

$ESPN

$FOX

$CNN

… (etc.) …

Topic

#baseball

#detroit

#football

#indianapolis

#sports

… (etc.) …

We expect the Article entity to accumulate a very large data set. In addition, we will likely need to have other entity rows hold sets of foreign key references to articles. Let’s give it its own column-store table.

We can represent a one-to-many Relationship to the Provider minor Entity in the same way as the relational model.

We can also represent a many-to-many relationship to the Topic minor Entity by using a set-based multi-value Attribute.

Articles

HK

RK

 

 

 

Article Id

(not used)

Provider Code

Topic Code Set

Subject

1001

$ESPN

{ #sports, #baseball, #detroit }

Tigers sign Prince Fielder.

1002

$ESPN

{ #sports, #indianapolis, #football }

Peyton Manning home team locker to be used by arch-nemesis Tom Brady?

2001

$FOX

{ #politics, #republicans }

Newt nabs South Carolina.

… (etc.) …

… (etc.) …

… (etc.) …

. . . (etc.) …

Managing Our Own Secondary Indexes

Given these two column-stores, how do we locate all Article rows published by a given Provider? As modeled so far, we will need to perform a full-table scan of the Articles table. The same concern applies to locating all Articles for a given Topic.

We can maintain an alternate key index mechanism by storing foreign-key values in set-based secondary attributes on the Minor Entities table.

Minor Entities

HK

RK

 

Entity Type

Entity Code

Related Articles

Provider

$ESPN

{ 1001, 1002 }

$FOX

{ 2001, 2002 }

$CNN

{ 3001, 3002 }

… (etc.) …

. . . (etc.) …

Topic

#baseball

{ 1001 }

#detroit

{ 1001, 3001 }

#football

{ 1002 }

#indianapolis

{ 1002, 4002 }

#sports

{ 1001, 1002 }

… (etc.) …

. . . (etc.) …

If we define additional Minor Entity types that do not relate to Articles, we simply omit the Related Articles attribute for those rows.

Note that DynamoDB does not provide any automatic join query mechanisms for traversing foreign-key paths. We must explicitly develop our own application logic to follow any such paths we define.

We can also get creative in our use of the Hash Key / Range Key mechanism on the Articles table. In our business domain, each Article has one and only one Provider. We can treat Article as a dependent entity and use this fact in our definition of the primary key structure.

Articles

HK

RK

 

 

 

Provider Code

Article Id

Provider Code

Topic Code Set

Subject

$ESPN

1001

$ESPN

{ #sports, #baseball, #detroit }

Tigers sign Prince Fielder.

1002

$ESPN

{ #sports, #indianapolis, #football }

Peyton Manning home team locker to be used by arch-nemesis Tom Brady?

$FOX

2001

$FOX

{ #politics, #republicans }

Newt nabs South Carolina.

2002

$FOX

{ #usa, #opinion }

Do we need to go back to the moon?

… (etc.) …

… (etc.) …

… (etc.) …

. . . (etc.) …

With this modification, the primary key value for each Article becomes a composite value. Some examples:

·        [ $ESPN // 1001 ]

·        [ $ESPN // 1002 ]

·        [ $FOX // 2001 ]

We can go further and include the original publication date for each Article in its composite key. This allows us to scan for a subset of articles for a given Provider published in a range of dates. For example, if we store the year and month of publication:

·        [ $ESPN // 2012-01 | 1001 ]

·        [ $ESPN // 2012-02 | 1002 ]

·        [ $MSNBC // 2012-01 | 4001 ]

·        [ $MSNBC // 2012-02 | 4002 ]

(This improvement will also help us purge older Articles from our active storage when that need arises.)

Choosing the Granularity of the Aggregate Root for an Entity

Based on initial review of the DynamoDB storage mechanisms, it appears that all rows in a column-store table that share the same Hash Key value must be able to fit together on a single cluster node. Each set of rows with a shared Hash Key value might be replicated to other nodes, but the set must fit completely on each such node. Some presentations use the phrase Aggregate Roots to refer to these Hash Key values shared by several rows in the storage system.

Suppose each Provider is very prolific, and that our application requirements require us to store long term archives of articles. (Perhaps we service a news history research population.) The large set of Articles from a given Provider may interfere with the load-leveling mechanisms of DynamoDB. The Aggregate Root storage footprint of some Hash Key values may become excessive.

We can reduce our single-node storage footprint by moving the publication month into the Hash Key portion of the primary key. This has the effect of partitioning each Aggregate Root into smaller subsets.

Articles

HK

RK

 

 

 

Provider Code | Publication Month

Article Id

Provider Code

Topic Code Set

Subject

$ESPN | 2012-01

1001

$ESPN

{ #sports, #baseball, #detroit }

Tigers sign Prince Fielder.

$ESPN | 2012-02

1002

$ESPN

{ #sports, #indianapolis, #football }

Peyton Manning home team locker to be used by arch-nemesis Tom Brady?

$FOX | 2012-01

2001

$FOX

{ #politics, #republicans }

Newt nabs South Carolina.

2002

$FOX

{ #usa, #opinion }

Do we need to go back to the moon?

… (etc.) …

… (etc.) …

… (etc.) …

. . . (etc.) …

With this revision, we can no longer scan for all articles for a given Provider. We must instead walk each month within a range of interest and query with exact Hash Key values such as [ $FOX | 2012-01 ]

We can recover our Provider scan ability by maintaining an alternate-key secondary Attribute in each Provider’s Minor Entities row. Remember this is derived data that we must maintain with our own application logic.

A single fetch to a Provider row yields the set of year/month values that contain Articles from that publisher. There is no need to attempt to access any other year/month values for that Provider.

Minor Entities

 

HK

RK

 

 

Entity Type

Entity Code

Related Articles

Publication Months

Provider

$ESPN

 

{ 2011-12, 2012-01, 2012-02, 2012-03 }

$FOX

 

{ 2011-12, 2012-01, 2012-02, 2012-03 }

$CNN

 

{ 2011-12, 2012-01, 2012-02, 2012-03 }

$WilsonQuarterly

 

{ 2011-01, 2011-04, 2011-07, 2011-10, 2012-01 }

… (etc.) …

 

. . . (etc.) …

Topic

#baseball

{ [ $ESPN | 2012-01 // 1001 ] }

 

#detroit

{ [ $ESPN | 2012-01 // 1001 ], [ $CNN | 2012-02 // 3001 }

 

… (etc.) …

. . . (etc.) …

 

The differing sets of attribute columns for the separate rows in the Minor Entities table should not cause much of a problem in DynamoDB according to the documentation currently available. However, this might cause issues with our application software’s data access layer. It all depends on how much we need or want relational style normalization.

Modeling our Subscribers and their Preferences

For our Subscriber entity here is one way to use the Range Key feature for storing the different preferences options.

Subscribers

 

 

 

 

HK

RK

 

 

 

 

Subscriber Code

(unnamed)

Hashed Password

Topics

Providers

Provider-Topics

@Alice

<BASE>

Dkjaafjjkjemded

 

 

 

<LIKES>

 

{ #politics, #sports }

{ $CNN, $MSNBC }

 

@Bob

<BASE>

 

 

 

 

<LIKES>

 

{ #politics, #detroit }

 

 

<DISLIKES>

 

{ #democrats }

{ $CNN }

 

<ALLOWS>

 

 

 

{ $CNN | #detroit }

@Chris

<BASE>

Afdjajewevwwb

 

 

 

<LIKES>

 

{ #politics }

 

 

<DISLIKES

 

 

{ $FOX }

 

@Pat

<BASE>

 

 

 

 

<LIKES>

 

{ #sports, #republicans }

{ $FOX }

 

<BLOCKS>

 

 

 

{ $FOX | #opinion }

Summary

This exploration has been based on conceptual analysis rather than any actual experience with NoSQL column-stores like DynamoDB.

Even in the tried and true world of relational database modeling, application developers have sometimes chosen to depart from third normal form. Their reasons may have been noble, or perhaps just expedient. In either case, some of these concepts have been used before.

Relational database systems operate on a single server or a small number of tightly-coupled cluster nodes. Within such an environment, those systems can provide automatic and consistent update of alternate key indexes. They also provide adaptive query optimizers that can perform join operations to traverse any number of foreign-key paths across several separate tables.

DynamoDB and similar systems provide highly available and scalable data storage mechanisms designed to make use of large numbers of loosely-coupled cluster nodes. In order to reach the associated performance goals, some of the automatic features of relational database systems must be omitted. The application logic must explicitly manage alternate key indexing and foreign-key joins.

Hopefully this short data modeling exercise suggests some options that can be applied to applications using NoSQL data stores.

Advertisements