March 20, 2019 - 9 min read
If you’ve got relational data, then there is a fair argument that you should use a relational database. However, there’s still a lot of good reasons to use a NoSQL store like DynamoDB:
DynamoDB requires you to set up a Primary Index
for a table, called the HASH
and optionally a sort key called the RANGE
.
Documents are grouped onto partitions by their HASH
key. Additionally, documents are stored in order with an optional RANGE
key. You can think of each unique HASH
as being its own conceptual table within the table. You can query for everything for a given HASH
key, and if you specified a RANGE
key, we can query for specific values or ranges of values.
Queries over any other attributes of a document require a full table scan which will become slower the larger your table gets.
DynamoDB also allows you to define additional indexes over your table.
I am creating a simple serverless application so that devs in the Scott Logic London office could order their breakfast for our weekly technical breakfasts.
I highlighted the following entities:
Breakfast - an event for which devs can order breakfast.
Order - a dev’s breakfast.
Item - An order is for an item, and an office has a list of possible items.
Dev - a hungry dev.
A Breakfast has many Orders, which has Items, which is made by a Dev. Simple.
With an SQL database, you model your tables on your relationships (which are usually fairly easy to understand). Following that you can flexibly query your data for most use cases (DBAs will be screaming here, I know it isn’t that simple).
NoSQL databases give you a lot of luxuries, but not that. To store relational data in a database you need to understand your access patterns up front. I created a requirements table for my simple domain highlighting what queries are needed. Throughout this post, we’re going to be referring back to these requirements as we make them possible.
There is also a constraint I want to enforce:
So we’ve got four entities: Breakfast, Order, _Item, and Dev. To fulfil requirements #1 and #2, we need to be able to access entities by their ID. To do this, I could put all of the entities in their table with unique columns names like so:
BreakfastId | ItemId | OrderId | UserId | Name | Date | Item | (…Additional Columns) |
---|---|---|---|---|---|---|---|
1 | 2019-04-29 | … | |||||
11 | Bacon Sandwich | … | |||||
0001 | Item-1 | … | |||||
janakerman | Jan Akerman | … |
Since the entities have different attributes, this would require the creation of a Global Secondary Index (GSI) for every single entity.
Whilst AWS’s soft limit of 20 GSIs per table means that this is a feasible approach for a lot of applications, it gets difficult as the complexity of your service increases. Also, there is additional storage and provisioned capacity costs for each index. We can be a little more efficient than that.
Instead, we can “overload” the Primary Index
by creating two more generic columns: a HASH
key named PartitionKey
and a RANGE
key called SortKey
. The name of the index attributes no longer reflect the attribute’s content. At the downside of less intuitive attribute names, we gain the ability to use a single index for all our entity access.
Look at our new Primary Index
represented by the table below.
PartitionKey (PI-HASH) | SortKey (PI-SORT) | (…Additional Columns) |
---|---|---|
1 | BREAKFAST | … |
11 | ITEM | … |
0001 | ORDER | … |
janakerman | USER | … |
#1 Get a Breakfast by ID - Querying the
Primary Index
by the breakfast ID will return the individual breakfast record.
The next access pattern we are going to solve is the ability to get all entities of a given type.
In a relational database, records are categorised into respective tables, but since our NoSQL table structure is a mash up of different entities we need to approach this in a different way.
At the moment, it isn’t possible to query the above Primary Index
for anything other than entity IDs. You may have noticed in the above examples that we have stored a String representing the entity type in the SortKey
column. This was intentional! By creating an additional Global Secondary Index
(GSI
) with a HASH
on the SortKey
column, it is now possible to query for specific entities by querying for their type.
SortKey (GSI HASH) | Partition Key | (…Additional Columns) |
---|---|---|
BREAKFAST | 1 | … |
ITEM | 11 | … |
ORDER | 001 | … |
USER | janakerman | … |
We’ve now completed two more of the application’s read requirements.
#2 Get all breakfast Items - Querying the
GSI
with the valueITEM
will return all of the Item entities.
#3 Get all Breakfasts - Querying the
GSI
with the valueBREAKFAST
will return all breakfasts.
Here we’re going to use a pattern called the “Adjacency List Pattern” to model our relational data. The core idea is that we model our relations as a graph, with a document entry for each node (entity) and edge (relationship). Let’s take the relationship between the Breakfast entity and the Order entity. As defined above, we have a one to many relationships here, each Breakfast can have many Orders.
We’ve already got records for all of our entities, so let’s add records for our relationship. Let’s also add a prefix to our identifiers.
PartitionKey (PI-HASH) | SortKey (PI-SORT, GSI-HASH) | (…Additional Fields) |
---|---|---|
BREAKFAST-1 | BREAKFAST | … |
BREAKFAST-2 | BREAKFAST | … |
BREAKFAST-1 | ORDER-0001 | … |
BREAKFAST-1 | ORDER-0002 | … |
ORDER-0001 | USER-janakerman | … |
ORDER-0002 | USER-hungrydev | … |
USER-janakerman | Jan Akerman | … |
USER-hungrydev | Hungry Dev | … |
As you can see, to add an Order for a Breakfast, we’ve added a record representing the relationship under the HASH
of the Breakfast the Order belongs to.
This allows us to query the Primary Index
for all records with a PartitionKey
(HASH
) of BREAKFAST-1
. This will return us the Breakfast entity and all of its relationships. We can efficiently filter this by querying with a *starts with ORDER-
condition on our SortKey
(RANGE
). This will return us all Orders for the Breakfast, without the Breakfast entity itself.
We’ve now fulfilled our two more requirements.
4. Get all Orders for a Breakfast - Querying the
Primary Index
HASH
for the Breakfast ID, and filtering theSort Key
for records that begin withORDER-
.
5. Get all Orders for a User - Querying the
GSI
HASH
for all records containingUSER-janakerman
.
If we extend our GSI
to use a ‘composite key’ (an index using a HASH
and a RANGE
), by indexing an additional generic Data
attribute, we can add another dimension of queries. This allows us to populate the Breakfast’s date in the Data
attribute, and query it within a range.
Here is what the table looks like from the perspective of our Primary Index
.
PartitionKey (PI-HASH) | SortKey (PI-SORT) | Data | (…Additional Fields) |
---|---|---|---|
BREAKFAST-1 | BREAKFAST | 2019-04-22 | |
BREAKFAST-2 | BREAKFAST | 2019-04-29 | |
BREAKFAST-1 | ORDER-0001 | ||
BREAKFAST-1 | ORDER-0002 | ||
… |
Here is what our table now looks like from the perspective of our GSI
:
SortKey (GSI-HASH) | Data (GSI-RANGE) | PartitionKey | (…Additional Fields) |
---|---|---|---|
BREAKFAST | 2019-04-22 | BREAKFAST-1 | |
BREAKFAST | 2019-04-29 | BREAKFAST-2 | |
… |
Notice that the two Order entries would be ignored in this index as they do not have a Data column value.
We can now query our GSI
for the HASH
of BREAKFAST
to get all breakfasts and provide a date range on our RANGE
key to get breakfasts between two dates.
#6 Get all Breakfasts between a certain date - Query the
GSI
for theHASH
ofBREAKFAST
and aRANGE
between two date values.
There is still a constraint that we’ve yet to apply to the table.
- There can only be one breakfast on a given day
The only constraint that DynamoDB gives us is that there can only be a single record for a given index key (HASH
or HASH
and RANGE
if composite). We can be clever with our entity keys to enforce unique constraints on certain data values.
For the above constraint, if we add the breakfast date to the Breakfast entities ID, we can ensure that there can never exist two breakfasts for a given date.
PartitionKey (HASH) | SortKey (PI-SORT, GSI-HASH) | Data (GSI-RANGE) | (…Additional Fields) |
---|---|---|---|
BREAKFAST-2019-04-22 | BREAKFAST | 2019-04-22 | |
BREAKFAST-2019-04-29 | BREAKFAST | 2019-04-29 | |
… |
If we try to add an additional breakfast entry on 2019-04-22
or 2019-04-29
, we’ll just update the record in question. If we want to ensure we don’t overwrite existing records we can also use DynamoDB’s API put
a new record if no entry already exists.
This post has demonstrated a technique for modelling relational data in DynamoDB. Whilst this DynamoDB table technically has no schema, there is most definitely an implied schema. You can see that even in this simple domain that the table isn’t particularly human readable, so without proper documentation things can get quite hard to understand. Additionally, you will undoubtedly be pushing some of the relational logic into your app layer when creating entities, so make sure you abstract this from your business logic.
Using DynamoDB has some amazing benefits and can really get you up and running with a production-ready database in a short space of time. Whether this works long term for your application is really a call you’ll have to make.
As always, abstract the database code from your business code, and be pragmatic - don’t be afraid to iterate quickly and throw things away when they don’t work any more!
Checkout this GitHub project and follow the instructions in the README.md to create a CloudFormation stack seeded with the data structure described in this post.
Engineer @ Form3. UK.
Github: @janakerman