MongoDb Schema Design / by James Heywood

Before we get started, if you haven't heard of MongoDb or are unfamiliar with document stores (as opposed to relational databases) have a quick read through the MongoDb site as a primer for this post here; All done? Good then as long as we're all sitting comfortably I'll begin.

So recently, (well I say recently, it was actually a few months ago) I worked on a customer engagement platform for a high end car manufacturer (Porsche) which I may or may not be at liberty to discuss, so let's just call them Car, to avoid any litigation from previous employers and third parties. The application is known as CarPlus.

As a note when I first published this article my previous employer rang me up to complain about my use of the name Porsche (as if somehow mentioning this was a sin?!) so I amended the article at their request, with hindsight I feel that I should have just stuck to my guns, after all I did the work and have a right to discuss it, so long as I don't give away trade secrets, anyway rant over!

Overview

This platform took the form of a responsive web application which selected Car owners were invited to (it is in the second phase of invites now, hopefully it will roll out to all UK owners soon) so they could gain access to knowledge and information about their car(s) and be a central place to store useful information about their vehicle(s) such as MOT, insurance and warranty details amongst other things.

The site took a mobile first design, initially the css/html framework was hand rolled by my front-end developer colleagues at Bite, for the second phase this was replaced with a customised implementation of bootstrap by the front-end developer from Brilliant Noise, one of three partner agencies in the project, the final partner being Endless a design agency who put the branding, look and feel together.

I could describe how the project was organised and the challenges we faced collaborating across three agencies, but this would be a post in its own right and for now I would rather concentrate on the technical side of things, in particular MongoDb and my experience with this technology on this project.

Technology

To summarise the technology, we decided to use a .NET platform as Bite at the time was primarily a .NET shop. Most of the sites we built used Castle Monorail and were variations on a core custom CMS built around Monorail, Windsor, Active Record & NHibernate, however for the CarPlus project we decided to try our hand with Microsoft ASP.NET MVC 4 and the Entity Framework. 

Shortly after development started we decided to swap Entity Framework for MongoDb as it seemed a good fit for the application. Read and write performance was important as was the ability to easily mutate the schema and scale out the back end given that the initial project was a bit of a proof of concept and could easily change direction in terms of data storage and scope. 

The final part of the stack was knockout.js which we decided to use as a way to provide a rich user interface that could optimise the data interface between the client and server. Further to this as the application would primarily be used on mobile devices having an optimal way to pass data back and forth and minimise the need for page refreshes was a key requirement, something that knockout deals with nicely, once you get to know it. If you haven't used knockout then give it a whirl, they have some excellent tutorials on their site.

Back to the project. One point to note is that having a .NET site between the client side knockout and MongoDb document store was a good fit for our skill set at Bite but from a technical point of view seemed a bit of an odd marriage. We store data as JSON (or more accurately BSON) in Mongo then deserialise this to concrete objects in .NET to work with in the server side code, then we serialise this back to JSON and pass it to the client for knockout to use. So we have this odd change from dynamic to static typing and then back again, which although odd actually works really well especially because the C# driver for Mongo is really nice to work with. There's probably a post in here somewhere, perhaps I'll get round to supplying some example code to show the path of data from the front to back end and vice-versa.

Schema Design

For now though let's look at MongoDb, specifically the way I designed the schema for the application and the pros and cons of my decisions.

This post is partly inspired by a MongoDb Webinar I attended last week, the 2nd in a series of 8, this last instalment was on the subject of schema design and issues to consider when designing an application, something I could have done with last year but you live and learn. It was interesting hearing from the guys at Mongo about this topic as it helped validate some of our thoughts and decisions on this subject as well as highlight some of the down sides. You can see the slide shows of the previous sessions and sign up for the rest of the series here; and I urge you to do this as they are really interesting and will help you if you plan on using MongoDb or indeed if you are already using it.

Anyway back to CarPlus, the application had several key user journeys identified with the main area of the application centring around the owners cars. This became known as the 'Vehicle Manager' and it allows users to swipe between the cars they own (oh to have more than one high end sports Car!) and see related information such as when their MOT is due, images of their vehicle and their preferred and/or nearest service centre amongst other things.

To structure this data in a traditional RDBMS I would follow the methods I was taught at University and identify the entities and relationships involved and produce an ERM to hold the data in a normalised structure, 3rd/4th normal form, no repeating groups, primary and foreign keys, referential integrity and all that jazz. This would be fine for a traditional transactional based system, however it would not be an optimal structure for the needs of this application and as we were taking data from the client's system that would remain the primary store for managing this data, we had no need to model and maintain a highly transactional database.

Enough of the background, what we ended up with was four main collections of documents,

  • Owners 
  • Vehicles 
  • Documents (Tax, MOT etc.) 
  • Centres

Owners and Vehicles have a variety of properties as well as nested sub-documents to allow for flexible access of data in the context of both owners and vehicles. Documents are related to vehicles and owners and have keys to allow us to relate the data to the relevant car and person. Centres are basically lookup data for use in the application.

A brief structure of an Owner document is shown below; 

The Owner Document

> db.owners.findOne()
{
        "_id" : 12345678,
        "Title" : "Mr",
        "FirstName" : "James",
        "LastName" : "Heywood",
        "Email" : "jdheywood@yahoo.co.uk",
        "EncryptedPassword" : "01a3089435c0394fc8653af0c65f81ef",
        "Address1" : "123 Some Street",
        "City" : "Brighton",
        "Postcode" : "BN1 1AB",
        "MobileNumber" : 07999999999,
        "VehicleCount" : 1,
        "Vehicles" : [
                {
                        "_id" : "16a1902b-1570-413e-b30b-ec734181a8e6",
                        "VIN" : "AB9AAA99AAAA99999",
                        "Model" : "Car Turbo",
                        "ModelCode" : "XXXXXX",
                        "Registration" : "ABC 123",
                        "CurrentMileage" : 50000,
                        "RegDate" : ISODate("2014-02-09T09:00:00Z"),
                        "Documents" : [
                                    {
                                            "_id" : "7da032cf-3bc7-41d8-8fd1-370a13fd11cb",
                                            "Type" : "Tax",
                                            "Name" : "Car Tax",
                                            "ValidFrom" : ISODate("2012-01-04T22:12:01.683Z"),
                                            "ValidTo" : ISODate("2013-01-04T22:12:01.683Z"),
                                            "Band" : "M"
                                    },
                                    {
                                            "_id" : "4619c928-c3e2-424e-a27e-7e3b00345136",
                                            "Type" : "Insurance",
                                            "Name" : "Car Insurance",
                                            "ValidFrom" : ISODate("2012-01-04T22:12:01.683Z"),
                                            "ValidTo" : ISODate("2013-01-04T22:12:01.683Z"),
                                            "Provider" : "Direct Line",
                                            "ReferenceNumber" : "REF-NO",
                                            "ContactNumber" : "0800 123 456",
                                            "Premium" : 450,
                                            "Excess" : 150,
                                            "YearsNoClaims" : 5
                                    }
                        ]
                }
        ],
        "LastServiceCentre" : {
                "_id" : ABC123,
                "Name" : "Car Centre Brighton",
                "Code" : "0987654321",
                "Address1" : "Address line one",
                "City" : "Brighton",
                "Postcode" : "BN1 1AB",
                "Phone" : "01273 123456",
                "Loc" : [
                        -0.14179,
                        50.822959
                ]
        },
        "VisitedCentres" : [ "0987654321", "1234567890", "2314568790", "02847462919" ]
}


As you can see an Owner has a collection of Vehicles, each Vehicle within an Owner has a collection of Documents which are populated by the user via the application, to help bring together useful information. 

Vehicles are duplicated in their own collection as I initially thought having another; flatter collection of this data would be useful to the application, more on this in a moment.

Documents are also duplicated in their own collection again for the same reason, flexibility of data access. 

The Owner has a property LastServiceCentre which is itself a nested document, again we have a separate collection of Centres for lookup purposes, the one held against the owner being the last place the owner visited.

Relation

As well as the last visited Centre we can also see that the Owner has an array of visited Centre identifiers. This is an example of a relationship, each identifier in this array relates to a Centre in the Centres collection. It should be noted that there is no referential integrity, these identifiers are managed by the application and could be any strings as far as Mongo is concerned.

My reason for storing these as identifiers is that if we kept each Centre visited by the Owner as a nested document this would result in a large amount of duplicated Centre data inside the documents of the Owner collection. I am prepared to accept a certain level of duplication in relation to the last visited Centre as this information is of particular use to the application, however information on all of the Centres ever visited is of less use so I have no need to tolerate duplication, in this case a set of identifiers will suffice and our application will just have to look these up if and when needed.

Nesting

This nested approach to the Owner data allows us to access data not just about the Owner but also related items such as Vehicles, Documents and Centres in a single request for data from the store. In a relational system we would have to make multiple requests for data from different tables to piece together the whole picture of the owner. Nesting data in this way reduces the chat between client and server, however it also means that data requests can be heavier than with a more piecemeal approach unless you are careful to project only those properties you need in your queries. 

In our case, as we access data about the Owner when they log in having this data in one document is a good thing, fewer requests makes for a more performant application which in turn keeps the user happy and aids in the adoption and use of the application. This is especially important considering that this is a mobile application first and foremost, the fewer page loads and data requests the faster and slicker the application feels and the better the user experience.

Duplication

As well as nesting Vehicles within Owners I also decided to maintain a separate duplicate collection of just Vehicles. My thinking at the time was that this would provide an alternative way to access the data as initially the requirements of the system were a little vague due to the fact it was a prototype. With hindsight this was a bad decision, the duplication of this data in itself is not really an issue as space is cheap and the size of the documents is minimal, the real issue is in the maintenance of this data. If a Vehicle needs updating it has to be changed in two places, which adds complexity to the application that it could do without. Further to this as we have had several developers on the project there is a risk that data is accessed and managed from one source or the other but that it is not kept in synch, therefore a review of the code is necessary.

With a separate Vehicles collection we do gain some advantage from simpler queries, for example if we want to identify all Vehicles of a particular model registered this year, (for an administrative purpose rather than for the use of the Owners) having the Vehicle collection allows us to write a simpler query than if we only had this data nested within Owners. I'm not convinced that this outweighs the overhead of maintaining the data in two collections though at the present time. Perhaps if the data analysis requirements expand in a later stage it will be more useful, but there is wisdom in coding for the here and now rather than for what might happen or may be required further down the line. If I had the chance to revisit this I would remove the Vehicles collection even though it may well be a bit of a tricky task at this stage.

Similarly the duplication of Documents has only really added unnecessary complexity to the system, we only ever access this data in the context of a specific Vehicle so there is no real benefit to storing this data in its own collection, again if I had the time to revisit and refactor I would definitely remove the Documents collection altogether and ensure access is only via the Owner.

Lookup

The Centre collection is of real use as a lookup, particularly due to the fact we have geolocation data (lat and long). In order to use Mongo's geospatial indexing and querying this had to be stored in a certain format, at the time (prior to version 2.4 of MongoDb) this was an array of two values [ x, y ] or more accurately [ longitude, latitude ]. This is now known as a legacy coordinate pair as the latest version of MongoDb supports GeoJSON format data which I have yet to have a play with. The geospatial query allows us to find the nearest Centres to the Owner by passing the lat and long of the user from their browser, provided their browser supports the W3C geolocation specification.

Summary

So to wrap this up this whole post is a rather long winded way of saying that duplication of data is fine as long as you are aware of the overhead(s) this presents to you when managing this data. If you would rather not have the trouble of managing more than one set of data then you can avoid this by nesting documents.

Relating collections is another approach you may wish to take, however this doesn't really play to Mongo's strengths and can feel like a safety net for those of us familiar with RDBMS who are perhaps a little nervous about making the leap to a document store.

I would strongly recommend that you consider the access paths to your data, what do you need and when? Further to this consider how many requests you want to make to read data and if appropriate write data, does your application need to worry about being chatty or not? The answers to these questions will dictate the best approach, nesting and/or relating data. For our purposes we have both although as I mention above we could and probably should do away with the duplicate collections and simply nest our documents to keep our code clean and maintain DRY principles.

And finally refactoring and revisiting code is a luxury, one which I suspect very few of us have in our day jobs. We often build up a technical debt that rarely gets paid off until the project rolls back around for an update. The more consideration we have for architecture and design up front the less debt we should incur.


Tl;dr: nesting is good, relating may be of use, duplication can cause headaches, code for access paths and requirements known now, refactor later.


Thanks for reading, next time I'll try and document some of the code used in a request/response cycle in the CarPlus application and discuss some of the challenges, wins and fails we had developing this.

Cheers,

James