March 20, 2019 - 9 min read
If you’ve got relational data, then there is a fair argument that you should use a relational database. However, there’s still a lot of good reasons to use a NoSQL store like DynamoDB:
DynamoDB requires you to set up a
Primary Index for a table, called the
HASH and optionally a sort key called the
Documents are grouped onto partitions by their
HASH key. Additionally, documents are stored in order with an optional
RANGE key. You can think of each unique
HASH as being its own conceptual table within the table. You can query for everything for a given
HASH key, and if you specified a
RANGE key, we can query for specific values or ranges of values.
Queries over any other attributes of a document require a full table scan which will become slower the larger your table gets.
DynamoDB also allows you to define additional indexes over your table.
I am creating a simple serverless application so that devs in the Scott Logic London office could order their breakfast for our weekly technical breakfasts.
I highlighted the following entities:
Breakfast - an event for which devs can order breakfast.
Order - a dev’s breakfast.
Item - An order is for an item, and an office has a list of possible items.
Dev - a hungry dev.
A Breakfast has many Orders, which has Items, which is made by a Dev. Simple.
With an SQL database, you model your tables on your relationships (which are usually fairly easy to understand). Following that you can flexibly query your data for most use cases (DBAs will be screaming here, I know it isn’t that simple).
NoSQL databases give you a lot of luxuries, but not that. To store relational data in a database you need to understand your access patterns up front. I created a requirements table for my simple domain highlighting what queries are needed. Throughout this post, we’re going to be referring back to these requirements as we make them possible.
There is also a constraint I want to enforce:
So we’ve got four entities: Breakfast, Order, _Item, and Dev. To fulfil requirements #1 and #2, we need to be able to access entities by their ID. To do this, I could put all of the entities in their table with unique columns names like so:
Since the entities have different attributes, this would require the creation of a Global Secondary Index (GSI) for every single entity.
Whilst AWS’s soft limit of 20GSIs per table means that this is a feasible approach for a lot of applications, it gets difficult as the complexity of your service increases. Also, there is additional storage and provisioned capacity costs for each index. We can be a little more efficient than that.
Instead, we can “overload” the
Primary Index by creating two more generic columns: a
HASH key named
PartitionKey and a
RANGE key called
SortKey. The name of the index attributes no longer reflect the attribute’s content. At the downside of less intuitive attribute names, we gain the ability to use a single index for all our entity access.
Look at our new
Primary Index represented by the table below.
|PartitionKey (PI-HASH)||SortKey (PI-SORT)||(…Additional Columns)|
#1 Get a Breakfast by ID - Querying the
Primary Indexby the breakfast ID will return the individual breakfast record.
The next access pattern we are going to solve is the ability to get all entities of a given type.
In a relational database, records are categorised into respective tables, but since our NoSQL table structure is a mashup of different entities we need to approach this in a different way.
At the moment, it isn’t possible to query the above
Primary Index for anything other than entity IDs. You may have noticed in the above examples that we have stored a String representing the entity type in the
SortKey column. This was intentional! By creating an additional
Global Secondary Index (
GSI) with a
HASH on the
SortKey column, it is now possible to query for specific entities by querying for their type.
|SortKey (GSI HASH)||Partition Key||(…Additional Columns)|
We’ve now completed two more of the application’s read requirements.
#2 Get all breakfast Items - Querying the
GSIwith the value
ITEMwill return all of the Item entities.
#3 Get all Breakfasts - Querying the
GSIwith the value
BREAKFASTwill return all breakfasts.
Here we’re going to use a pattern called the “Adjacency List Pattern” to model our relational data. The core idea is that we model our relations as a graph, with a document entry for each node (entity) and edge (relationship). Let’s take the relationship between the Breakfast entity and the Order entity. As defined above, we have a one to many relationships here, each Breakfast can have many Orders.
We’ve already got records for all of our entities, so let’s add records for our relationship. Let’s also add a prefix to our identifiers.
|PartitionKey (PI-HASH)||SortKey (PI-SORT, GSI-HASH)||(…Additional Fields)|
As you can see, to add an Order for a Breakfast, we’ve added a record representing the relationship under the
HASH of the Breakfast the Order belongs to.
This allows us to query the
Primary Index for all records with a
BREAKFAST-1. This will return us the Breakfast entity and all of its relationships. We can efficiently filter this by querying with a *starts with
ORDER- condition on our
RANGE). This will return us all Orders for the Breakfast, without the Breakfast entity itself.
We’ve now fulfilled our two more requirements.
4. Get all Orders for a Breakfast - Querying the
HASHfor the Breakfast ID, and filtering the
Sort Keyfor records that begin with
5. Get all Orders for a User - Querying the
HASHfor all records containing
If we extend our
GSI to use a ‘composite key’ (an index using a
HASH and a
RANGE), by indexing an additional generic
Data attribute, we can add another dimension of queries. This allows us to populate the Breakfast’s date in the
Data attribute, and query it within a range.
Here is what the table looks like from the perspective of our
|PartitionKey (PI-HASH)||SortKey (PI-SORT)||Data||(…Additional Fields)|
Here is what our table now looks like from the perspective of our
|SortKey (GSI-HASH)||Data (GSI-RANGE)||PartitionKey||(…Additional Fields)|
Notice that the two Order entries would be ignored in this index as they do not have a Data column value.
We can now query our
GSI for the
BREAKFAST to get all breakfasts and provide a date range on our
RANGE key to get breakfasts between two dates.
#6 Get all Breakfasts between a certain date - Query the
RANGEbetween two date values.
There is still a constraint that we’ve yet to apply to the table.
- There can only be one breakfast on a given day
The only constraint that DynamoDB gives us is that there can only be a single record for a given index key (
RANGE if composite). We can be clever with our entity keys to enforce unique constraints on certain data values.
For the above constraint, if we add the breakfast date to the Breakfast entities ID, we can ensure that there can never exist two breakfasts for a given date.
|PartitionKey (HASH)||SortKey (PI-SORT, GSI-HASH)||Data (GSI-RANGE)||(…Additional Fields)|
If we try to add an additional breakfast entry on
2019-04-29, we’ll just update the record in question. If we want to ensure we don’t overwrite existing records we can also use DynamoDB’s API
put a new record if no entry already exists.
This post has demonstrated a technique for modelling relational data in DynamoDB. Whilst this DynamoDB table technically has no schema, there is most definitely an implied schema. You can see that even in this simple domain that the table isn’t particularly human readable, so without proper documentation things can get quite hard to understand. Additionally, you will undoubtedly be pushing some of the relational logic into your app layer when creating entities, so make sure you abstract this from your business logic.
Using DynamoDB has some amazing benefits and can really get you up and running with a production-ready database in a short space of time. Whether this works long term for your application is really a call you’ll have to make.
As always, abstract the database code from your business code, and be pragmatic - don’t be afraid to iterate quickly and throw things away when they don’t work anymore!
Checkout this (GitHub)[https://github.com/janakerman/blog-relational-dynamo] project and follow the instructions in the README.md to create a CloudFormation stack seeded with the data structure described in this post.
Senior Full Stack Developer @ Scott Logic. London, UK.