Patel: Let's get started. In March 2016, Mondo set out to create a bank which is tech-driven and mobile first, and launched a £1 million crowdfunding. The offering was so successful that it brought down the third party that was running the crowdfunding. When the pitch was run in 96 seconds, we sold out the full £1 million offering. Now investments were capped at £1,000, and we had a wide variety of investment amounts from individuals. Some individuals invested the maximum of £1,000 within the 96 seconds, but there's a wide distribution.
Now, moving on to February 2017, Monzo's raising £2.5 million via crowdfunding, and this time the investment is run as a random ballot. We didn't want to repeat the same problems as last time and cause a bunch of frustration. But whilst there was a longer period of time that you had to get your ballot in, ultimately it was a random ballot so some people lost out, which was unfortunate. Now, moving on to late 2018, Monzo is raising £20 million via crowdfunding and the law says we can only offer the shares to a fixed group of people, not the general public. This time we wanted to provide a system which was equal and fair so that everyone who wanted to invest could get a chance to invest up until we hit the £20 million.
Now, we deliberated on running it on another platform. There are a few third-party platforms out there that do crowdfunding nowadays, and we had been assured that they were all resilient and reliable. But we didn't want our crowdfunding to be impacted if they had issues, and there were other reasons as well. So in the next 40 minutes, I'll be taking you through our platform architecture, and the kinds of things we did to prepare for this crowdfunding, and justify the reasons of why we ran it on our own platform.
My name is Suhail Patel, and I'm a platform engineer at Monzo. I work on the infrastructure and reliability squad at Monzo, and the way I describe what we do is we work on the fundamental base to allow other engineers to ship applications. Our philosophy is that engineers should be able to ship applications without having to worry about if there's enough compute capacity in the cluster, or whether there's enough database space available, and stuff like that.
Now, how many of the audience are familiar with Monzo? Wow, a wide range of hands. So for the few people who aren't, we are a fully licensed and regulated bank operating in the U.K. We have more than 1.5 million customers as of last week, I believe, and we're growing really fast. Our mission is to give you the most out of your money. We want to make you financially savvy, effortlessly. We also have these really nice hot coral cards lurking around, so yes, that's another reason to have a Monzo account.
A Brief Overview of our Platform
Before we dive into the challenges that we faced for crowdfunding, I'd like to give you a brief overview of our platform and our architecture. Our platform consists fully of microservices, I'm sure microservices has been the word of the day, it seems, going all the way from Sarah's talk this morning, and looking at a lot of the tracks for today as well. We have 900 microservices running in production and growing, and services spanned from operating components of the bank. So, sending and receiving money, providing a ledger, detecting financial crime, and generally providing app functionality. When you use your card in a shop, about 80 microservices get involved, directly or indirectly, in making that decision on whether you can actually use the card. Everything from checking your account to make sure that they exist and you haven't closed it down, checking to make sure you have enough money, making sure that the transaction is not fraudulent, but also determining what emoji to send when we send you the notification.
It's not all about running a bank. We also provide a platform for our engineers to ship whatever they want. We have services that provide GitHub functionality so that when engineers are pushing and reviewing code, they get notifications on Slack that their code is now ready to merge and ready to deploy. We also provide services to help make staff onboarding easier.
Now, the majority of services at Monzo are built in Go. We won't be speaking about too much Go today. But for developing small services, microservices, it provides some excellent key things like statically typed and memory managed, which makes engineers very happy to make sure that you don't cast your string into an integer and have your program crash. You get excellent concurrency right out of the box, which makes platform engineers like me very happy.
You can still have thousands of requests with very little CPU and memory. Fractions of a core and 50 megabytes of memory is perfectly sufficient for serving thousands of requests, and you get a single runnable binary. No more praying to the Python gods every time you want to install your infrastructure and the PIP server is down or your requirements.txt is no longer deterministic. Now, if you want to learn more about how we built a bank in Go, I recommend a talk from a previous QCon by Matt Heath, called "Building a Bank in Go."
Now, to actually ship these microservices, we leverage containers and Kubernetes. We want to make sure that engineers don't have to learn all of Kubernetes as a prerequisite to shipping to production. We want engineers to be productive within their first week, and learning how to write hundreds of lines of declarative Kubernetes YAML is not fun. Engineers want to ship services. They don't want to have to battle with Kubernetes. Engineers can tell our platform that their code has now been reviewed and is ready to land in production, and we built tooling to help get that rolled out safely. This allows engineers to ship 70 to 80 times a day, and growing fast.
Now, all services are shipped on top of our Kubernetes cluster running on AWS. Services have lots of replicas, or pods in Kubernetes' parlance, to allow for high availability and scaling out as requirements grow. We essentially want to pack lots of services onto a smaller amount of infrastructure. As a platform, we want engineers to build services that are self-contained, scalable, and stateless. Containers provide essentially the self-contained architecture. You pack everything that your binary or your application needs within your container, and it's fully encapsulated. You might need to download a secret from somewhere or something like that. But most of the time all the data that you need is within your container, so that's self-contained covered.
For scalable, as a platform, if we give your application more resources or if we scale out your application to add more replicas, we want you to do more work or faster. Being able to run your application as more replicas also helps with high availability. And for services to be stateless, a replica of a service doesn't need to know anything about the host that it's running out. It has no affinity. So if it's serving a particular request, it's not tied in. That can move to something else and it would be executed in the same deterministic manner. As a platform we provide the frameworks and guidance to make sure that engineers build applications to help satisfy these requirements.
Now, with services having these properties, it helps us a lot where nodes fail. No matter what platform you use, no matter what infrastructure you provide, gone are the days where you have a computer, and it lives for thousands of days, and then you decommission it in a structured manner. In clouds, like AWS and everywhere else, nodes fail. They might tell you, they might not that this node is now failing. Node failure is routine. Especially if you're running in a large infrastructure, or within an environment like AWS, a node is going to go down at 3:00 a.m. and it shouldn't be a page worthy event. No engineer should have to be woken up because a node has died, if it has zero impact on the running of your application. We recently undertook an exercise where we essentially moved all of our replicas, all of our pods in Kubernetes, to completely different nodes, essentially and effectively restarting the entire bank.
So in an environment where things are moving around so often, and things are never where you left them, how do services communicate with each other? Let's say that you're the transaction service on the left in yellow, and you want to talk to the account service on the right. Here's where the service mesh comes into play. Each of our worker nodes on Kubernetes has a service mesh running on it as a Kubernetes DaemonSet. That means it's running on every Kubernetes node locally, and they're at a defined port that service.transaction, in this scenario, is aware of. You proxy the request to the service mesh, and it handles everything needed to get that request to service that account.
Now, the service mesh is essentially what ties services together. It's responsible for service discovery and routing, keeping pace as things move about, and also handling key things like retries. For example, if there's a transient failure, and timeouts if things are taking too long, and circuit breaking, so you're not essentially denial of servicing your downstreams every time there is a minor problem.
If services want to communicate with each other on our platform, all they need to do is speak HTTP. This keeps services approachable and language agnostic. To add a new language on our platform, you don't need to re-implement all of this logic to find out where services are, the whole service discovery mechanism, timeouts, retries, circuit breaking, as a barrier to entry. Services have really good HTTP libraries and we communicate over JSON, but it's not absolutely required. Languages have great HTTP libraries, so this makes it really agnostic and really approachable for new languages. During our crowdfunding, during that time, we were using Linkerd as our service mesh, and its Namerd component to integrate it with Kubernetes. So when a pod would move in Kubernetes, Linkerd would find out about it and it would be able to proxy requests to the right pod.
We've talked a lot about direct RPCs in the previous slides, but a lot of the data flow and compute work happens asynchronously. For asynchronous messaging, we provide NSQ and Kafka. So I'm sure a lot of you would be familiar with Kafka. NSQ is a similar messaging library written in Go, and it provides a really high throughput, a very high throughput message bus essentially. Last I checked, we were sending billions of messages through NSQ every day, and this is not provisioned on any like fancy hardware.
Now, just because it's asynchronous messaging, it doesn't mean it needs to be slow. When you use your card, for example, this notification gets generated that you spent £4.99 at Starbucks. Other coffee shops are available. Yes, you spent £4.99 at Starbucks and we've generated the right emoji, and we've sent this notification to you, on your phone before the till has even printed the receipt.
Now, whilst a lot of our, well, pretty much all of our services are stateless, services do need to store data somewhere. You can't just lose your entire ledger if you delete the pod or have the node fail. To do this, we provide a highly available Cassandra cluster. Who's familiar with Cassandra? Oh, quite a few spattering of hands. Great. For those who aren't, Cassandra's a highly available and scalable database. Essentially, Cassandra nodes join together to form a ring, and reads and writes can come into any Cassandra node. There isn't a master, which has a great property in the case of node failure, which we'll talk about later. Generally, Cassandra clients, or people who are consuming from Cassandra or writing to Cassandra, will issue reads in the fastest-node, or in a low-latency, or in a round-robin fashion.
When you read or write to Cassandra, your table has a replication factor. So you can say, "How many nodes do you want to hold that data?" In this example here, service.transaction, has come to the green node here requesting some data for that transaction there. All the nodes in the Cassandra cluster are aware of which nodes hold that bit of data. Now, in this example, the green node knows, just like all the other nodes, that it doesn't hold that data. But it knows that the orange nodes have that data. The transaction table has a replication factor of three, so here we have three nodes in orange, and the query has specified local quorum. So essentially, it will reach out to all three nodes, and once the majority return with a consistent view, it will return that back to the service.
Now, Cassandra has a lot of flexibility. If you want the fastest response and you care a little less about the consistency of the data, you can, for example, use a quorum of one and get a response back from the node which was able to give you the answer the quickest. This allows for a wide variety of use cases. Having the replication and quorum mechanism means that if a node dies, or restarts, or you need to perform general maintenance, you can continue business as usual, serving all requests and all queries with zero impact to your applications. We undertook an exercise last quarter where we essentially restarted all of our Cassandra nodes to see how applications would behave in the case of failure.
Now, a distributed system, you need a way to do exclusive operations. We leverage Etcd, a highly available and distributed key-value store. Etcd has great locking primitives which allow for really high throughput and low latency distributed locking. I took this animation from this raftexample. So similar to Cassandra, reads and writes to Etcd can come to any node. But unlike Cassandra, there is a leader that's established using the Rough Consensus Algorithm. And changes have to involve the leader, and propagate to the majority of replicas before it's confirmed and returned.
Now, if a leader dies, a new election is held to elect a new leader. So in this example here, these green bubbles are essentially election of a leader, which is now S5. And S5 is now communicating messages outwards, and ensuring that every replica is aware of this new state. Having this sort of distributed consensus is an ideal and, well, pretty much required property for locking.
Finally, we have Prometheus in our stack, the all monitoring eye. Each of our services expose metrics, Prometheus metrics, at a /metrics endpoint. Prometheus provides a really nice querying language to help you expose that data. Services built on top of our Go frameworks can essentially get a ton of Prometheus metrics for free. Engineers don't have to write a single line of metrics code to get information about RPCs, and asynchronous messaging, and locking, and Cassandra queries. We encourage engineers to put their business metrics and monitor them as additional things, but it's how we're able to have 900 services with good observability. The last I checked, we were scraping about 800,000 samples into Prometheus a second, which is a lot.
Combining Prometheus with Grafana, you get a really high capability and insight into what your application is doing. We have consistent metrics on naming and labeling across all of our services, and we augment that with data from Kubernetes, so you can figure out on a per-host or per-pod basis what your application was doing, so you can detect and see if you were maybe co-located against a deviant node or a pod which was using up all your resources.
That was a brief introduction about the major components in our platform, and now I want to talk about some of the work we did for crowdfunding and building a back-end.
Building a Crowdfunding Backend
There were four key requirements that we needed to tackle before we could go ahead. The first two were hard requirements, like from an application perspective, just generally having the round. We could not raise a penny more than £20 million, which is a nice problem to have. We'd agreed with investors, institutional investors, and our board and regulators that £20 million was the absolute cap. As a result, requirement number two, in order to alleviate frustration, we wanted to make sure that users had enough money in their account to be able to invest, and so we could like correctly close the round once we'd hit the cap.
The last two were requirements for the platform in general. We expected users to use, or customers to use, Monzo more frequently. This investment was being done by our app, so naturally, you're going to want to check how much money you have in your account, and top up if need be. Yes, we expected the load to be beyond just our normal investment load, and the investment services were not the only services that were going to be here. You are going to want to check your balance, and your account, and your transaction history to make sure that it's all been completed.
All banking functions had to continue to work. We couldn't bring down the bank. While this crowdfunding was running, if you wanted to go spend money on your card, we shouldn't have been able to stop you. That was paramount. So for the first two requirements, to truly satisfy these requirements, ultimately, you need to consult a chain of record. And essentially we needed to look at the source of truth, which in our case is a ledger. Now, if we were to read from the ledger every time we wanted to ensure that users have enough money, especially in a high throughput scenario, there would be significant compute and latency costs in making that decision. And at high throughput, we would need a significant scaling in order to satisfy the demand that we were expecting.
Now, in most relational databases, you have transactions and counters for exactly this purpose. You can rely on atomic updates, and atomic locking, and stuff like that, and it can be relatively cheap depending on the database technology you're using. Cassandra has support for counters, but comes with this warning that, "Replication between nodes can cause consistency issues in certain edge cases," and that didn't quite fly with us. Also, when we've used Cassandra counters in the past, they have not been very performant. So that's that out the window.
We definitely want to write data in a fashion where it can be replicated 100% to all the nodes. In the case of node failure, which we described earlier, is very common. This combined with the fact that they're not performant essentially discarded the solution. So what we developed instead is a two-layer architecture. Your phone would contact a pre-investment service which relied on cash balances. This means that user request time, we didn't need to do all the expensive ledger calculation. What we did is essentially, we did the cash look-ups and decided on the spot whether it should go on to a queue, where we could do, essentially, a rate-limited consumption. So your pledge would end up on our queue, which is NSQ here in this case, and the actual investment service would pull things off, and do the necessary checks, and communicate with our partners to actually issue the actual shares themselves when need be.
Now, just to reiterate here, "Just because it's asynchronous doesn't mean it needs to be slow." Most people had the full flow completed within minutes, sometimes even seconds. We had the crowdfunding investment, consumer at a high throughput, but with this architecture it was able to handle bursts. So that’s the first two requirements sorted.
Now, let's go into handling traffic and not bringing down the bank. Now, we could have spun up fully dedicated infrastructure and that would have been the sensible decision probably. But we decided not to because the overhead and change that we'd need in our development and deployment practices meant that the engineers working on crowdfunding would have a major frustration every time they wanted to deploy their services. Also, a lot of the services on the crowdfunding flow, and all the services that were touched afterwards, for example, when you open the app subsequently, relied on services that lived in our main infrastructure. So essentially, we wanted to take a more coordinated approach on our platform. Yes.
More philosophically, the platform team, which is the team I work on, wanted to use this as an example, as an exercise in identifying bottlenecks rather than papering over them by spinning up completely new infrastructure. As we grow, there will be increased demands. Monzo might want to run a media campaign or whatever. Our platform needs to be ready and needs to be resilient to this sort of bursty traffic. So being able to do this exercise would give us the learnings, and the capabilities, and give confidence to other engineers.
Now, the problem we needed to solve for was twofold. We'd built the investment flow, so the actual flow that you did to get your investment completed with load in mind. But the case we were worried about is what happened afterwards. So you tap the button that says, "Take me to my Monzo account," and you'd be using Monzo app functionality. Generally we get about 10 to 20 app launches per second on a day-to-day basis. But we expected many orders of magnitude more, especially in the first few minutes, if we recall what happened in 2016, one-and-a-half minutes for the entire round to completely finish. Now, our benchmark was, "What if every one of our customers opened the app to invest?" We needed to make sure that none of the payments functionality was going to be impacted, and we couldn't take down the bank, critical requirement.
So that was a brief introduction into our platform and how we built the crowdfunding back-end. Now, we're going to talk a bit about load testing and finding bottlenecks.
Load Testing and Finding Bottlenecks
Naturally, we needed to do some load testing, and we went and searched to find some off-the-shelf components to do some load testing. My favorite is Bees with Machine Guns. What a great name. Imagine putting that into a proposal, "I want to run Bees with Machine Guns on our infrastructure." And it does exactly what the tin says. It spins up a bunch of infrastructure and essentially hammers the endpoint that you give it.
But one of the core things we wanted is we wanted to integrate with Prometheus. We wanted the metrics in our metric system so that other engineers could gain insight whenever we were running load tests. That's something that a lot of these systems did not provide. You get a really nice HTML report, but if you're running load test like hundreds of times over a period of weeks, we want engineers to not constantly contact the platform team, "Are you running a load test? I'm seeing like an order of magnitude more requests to my services." So we decided to build our own.
We use Charles proxy to capture requests from our iOS and Android apps, and we plugged the request pattern, the exact request pattern that the apps were doing, as jobs. We wrapped that in as a load test service, naturally written in Go. This would be a normal service that would be deployed on our platform, on our infrastructure, and it would fire requests at our edge just exactly like an app would. To control it and invoke it on demand, we built a small wrapper. We called it Bingo, named after our dog. Bingo loves testing resiliency. Here's the output after testing resiliency of some cardboard boxes. So yes, Bingo one, cardboard boxes nil.
A quick anecdote. We built our own load testing platform and that's all well and good, and we put it on our staging platform, and we started noticing a large number of failures. We were getting a bunch of errors in DNS name resolution. Now, we were using the AWS Route 53 DNS resolver, and our load tests were constantly resolving the Monzo.com domain. Every time we've run a load test, every request, it was resolving the Monzo.com domain, and we got severely throttled by AWS. Unfortunately, this is an undocumented limit, so we didn't know beforehand. So yes, be a good DNS citizen and respect your cached TTLs if you ever write your own load tester.
Another thing we noticed is whenever we'd start load testing, we noticed a very large latency spike with significant volume. We didn't see these spikes at our edge metrics layer, at our edge proxy layer. That got us thinking, because we needed to identify if our requests were timing out. Our mobile apps have timeouts, third parties who are contacting us will have timeouts, and we didn't want these requests to fail. It turns out that the load balancer we use, the application load balancer in AWS, if it sees a significant shift in traffic, it will provision more resources and it will hold connections for the period of time until it can spin up the new infrastructure. So that meant for six to seven seconds, every time you run a load test your connections were being held. Then we realized that we should be ramping up slowly so that AWS had enough time to provision this infrastructure. That was nice to learn.
Now, back to load testing. We have our load test worker, and we've used it to do testing on individual services and components of the bank. We've tested stuff individually like Etcd, and Linkerd, and Cassandra using artificial data. Synthetic testing is nice because you can test things in a controlled environment, and you can reproduce it. But if we were going to find the real bottlenecks, we needed to test in production. This is the environment that mattered. This is the environment that was going to get hammered when people wanted to invest in Monzo, and it was the one that was going to be under pressure.
Now, one key motivation for writing our own load tester was that it could authorize and gain read-only access to real customer data in a secure manner. And because it was running fully within our own platform, once the request was done, the data never left our infrastructure. The load tester didn't care about the data that's returned as long as it's returned successfully.
What we did is we started testing small amounts, small amounts of real app usage, and ramping up. Our target was 1,000 app launches per second. Could we sustain a thousand app launches per second? Remember, we're doing 10 to 20 app launches a second usually. Could we sustain 1,000 app launches per second for every one of our customers? If every one of our customers opened the app, a thousand at a time, could we sustain that? Essentially to simulate a thundering herd.
Now, we use this process to identify the services that were involved with running the app. With 900 services on our platform and growing, not every service is involved in serving app functionality, and that chain is constantly changing. As new engineers write new services to interact with older services, the whole architecture and everything is completely changing. Request patterns are changing. You might be calling a service more this week or less this week. The majority of our services are not directly involved with handling payments or servicing app requests, so we had to find the right subset of services. We hooked in Kubernetes' API into Slack, so we get nice notifications when things like OOM kills, or out of memory issues, or when a service dies, or is being severely throttled.
Even when we're under load, we want our app to be performant. You don't want to have to wait lots of time for your balanced load, even if you're not at all interested in investing on Monzo. So essentially, we used request timings. This is another benefit of rolling our own load tester is that we could get really fine-grained metrics. We used this to guide us in what services we should scale up, and to ensure that we had that data.
Now, in terms of scaling, we can choose to scale out or scale up. In terms of scaling out, you can just pretty much add more replicas. So, we've got the replica account. This is an example of Kubernetes YAML. You can essentially change that number from 9 to 18, or whatever number, so essentially doubling the number of instances running your application. An alternative is you can also scale up and add more CPU and memory for a particular service. Kubernetes allows you to control the amount of CPU time and memory allocation on a per-pod basis. Most of our services, because they're written in Go, don't need a lot of CPU and memory, but we could use these to provide more breathing room, essentially, on nodes. There is a balance to be struck. You want your service to have enough resources to be able to handle bursts and yield work, but you don't want really large requirements either so that you need thousands of nodes and you can't do efficient bin packing for your services.
Now, a lot of you may argue that what we're doing here has already been invented and it's called auto-scaling, and you would be right. Most auto-scalers though need historical data in order to be fully efficient, and we didn't have that kind of data at these volumes that we were pushing. We also wanted to know our requirements beforehand so we could scale up the capacity. We didn't want to be on the news saying that, "We failed to scale up adequately beforehand”, especially when the crowdfunding round opened, and that's when we expected most of our traffic. For us, the minutes that it takes to provision AWS instances or more nodes mattered, as was evident in our 2016 crowdfunding round.
We've been scaling individual services so far. But once we got to about 500 app launches, we started seeing our Cassandra cluster perform disproportionately worse. We needed to dive in to figure out what was happening with Cassandra. Now, just to provide some context, we have about 21 Cassandra nodes, and these are running on really beefy machines. They have plenty of CPU and RAM. And we believe that the query volume that we were generating shouldn't have impacted this kind of cluster capacity at the volumes we were seeing. It should have coped just fine.
This is the point where we break out the profilers. We use a mixture of two JVM profilers, Swiss Java knife and Async-profiler, both of which are fantastic projects which are open source, to get a better understanding of what Cassandra was doing. The bottlenecks we found were lying in three places, two which we sort of expected because of the way that Cassandra works, and one which was completely unexpected. Our first bottleneck, which is generating Prometheus metrics, was the unexpected one. It's nice to have insight into what your cluster is doing, but if it's using so much resources to generate that insight then you need to question the value there. If it's the cause of all of your CPU usage, that's not great.
We performed a test to validate this. At first we thought that the profiler was a bit broken, so we turned off Prometheus scraping on one of our Cassandra nodes temporarily and saw an instant 10% drop in our CPU usage. The Prometheus metrics was running as a Java agent within our main Cassandra process, so we moved it out of process and put high CPU constraints so that it could still make progress and give us some metrics that we wanted, the cadence that we wanted, without impacting the rest of the cluster.
Now, our next bottleneck was time spent in decompressing data from disc. With Cassandra, Cassandra compresses data when it writes to disc. When you do a write, it will batch those writes up, and then when it's writing to disk it will compress them using an algorithm called LZ4. While we have plenty of space, our NVMe disks are really big, compression yields significant benefits, especially if you're doing ranging of data, because it means you have to do less disk seeks to grab a range of data. So if you say, let's say, "I want every user between ID 100 and 200," you can probably do that in one or two disks seeks rather than hundreds.
Now, when you read a row using a row key in Cassandra from disk, it needs to read from a data structure called an SSTable. SSTable stands for Sorted Strings Table, and Google talks about it in its Bigtable paper. Now, generally, in SSTable, you have row keys sorted as chunks and an index telling you where certain row keys begin. Generally in our services, we're reading individual rows, so like, "Give me a particular transaction. Give me a particular user." But for most of our tables, we were specifying a 64-kilobyte chunk size. So even when uncompressed, we were reading 64 kilobytes when the majority of our rows are a few hundred bytes.
So what we did is we identified the hot tables and then lowered them to a 4-kilobyte chunk size, which significantly reduced the amount of disk I/O we were using and also the amount of CPU usage, because decompressing 4 kilobytes is more efficient than decompressing 64 kilobytes. Now, there is a cost in trade-off here. You use more memory because the index gets larger, because that’s 4 kilobytes, that's 16 times more amount of keys you need to index on. But you can see that we essentially halved our latency for this particular table.
Now, a final bottleneck which we identified was pausing of CQL statements, which is the Cassandra Query Language statements. For pretty much all of our services that use Cassandra, they had a fixed model, a fixed schema, and deterministic finite querying patterns. When you deployed a version of your application, for example, the transaction service, that transaction service would have a fixed model and schema. For the lifetime of that application process, that schema wouldn't have changed.
Now, generally, when you want to solve this problem in any database you use prepared statements. With Cassandra it's no different. Cassandra supports prepared statements. The library we were using for our Go services actively used them out of the box. It was using them for everything: select, inserts, updates, deletes, everything. And Cassandra supports all of this. We could see, even when we did stuff like TCP dumps, that we were correctly preparing statements. So why are we spending all of this time pausing CQL statements?
What we did is we actually found a bug in our library that does the query building. Generally, engineers don't write raw CQL by hand. That would not be very fun. They use an abstraction, in Go at least, which takes a struct and uses reflection to figure out how to map those fields to Cassandra columns. And what we found is that we found a bug in that library where when it was generating statements, it would generate them in a non-deterministic manner.
For our services and for Cassandra mostly, these two queries are essentially equivalent. They give you the same data. However, for Cassandra itself, these two queries need to be prepared twice. What it does, it literally takes the string representation of your query and uses that as the hash key. What we found is that Cassandra was constantly churning through its limited cache of where it stored prepared statements. And with like 350 to 400 services writing to Cassandra, some of our models were really big, had lots of fields. So essentially, we had thousands of permutations of the same table being stored in the prepared statement cache. Future versions of Cassandra allow you to tune the size of that prepared statement cache, but still, this was a big bug.
Once we'd fixed the Cassandra bottlenecks, we were able to press forward with our goal of 1,000 app launches per second. And we got to about 800 before we started seeing a bottleneck in our service mesh. Now, a bottleneck in our service mesh is bad. It's bad because RPCs would take much longer than usual, and we have a lot of work loads which are RPC latency sensitive. So, for example, the AT services that are directly or indirectly involved in making a transaction decision, that needs to happen within a low threshold of time, usually one to two seconds.
If we start creeping beyond that threshold, you don't want to be waiting there at your merchant forever to buy your chocolate or whatever confectionery that you really want. We need to make that decision in a small period of time. So we scrape CPU and throttling metrics for all of our pods across our Kubernetes cluster. And when we looked into our service mesh, we could see really high CPU usage and really heavy throttling, which was really odd considering how much resources that we had allocated to the service mesh.
When we investigated deeper, we actually found that our thinking behind how much resources we'd allocated to the service mesh was not entirely correct. The service mesh deployment in Kubernetes hadn't been updated. The actual deployment was saying it was allocated this amount, of course, but the actual pods themselves hadn't picked up that change. This led to some interesting behavior, because the throttling would lead to less amount of CPU time. That would lead to increased connection build-ups, because more things now want the service mesh to do the routing and everything like that, and that led to even more CPU demand. So essentially it's a chicken-and-egg problem.
After all of this, we reached our target goal of 1,000 app launches per second. We did also some tuning to do the number of our RPCs in our platform, just some services were making some redundant RPCs. But what we ended up with is a comprehensive spreadsheet and historical data that we could use the tune our auto-scheduler in the future.
Now, no matter how much we prepared, we needed a plan to recover if things went wrong. One common thing you can switch to is feature toggles. We could degrade functionality in the app or we could degrade functionality in the back-end, so that we could turn things off without impacting our functionality. We also implemented a method to shed traffic even before it entered our platform. This was key, because the moment it entered our platform meant that it was impacting our services. And given that we provide a giant platform, it was very important that we could shed load if needed. Luckily, we didn't have to resort to that on the day, but it was good to know that we had the lever should things be completely on fail.
With it, really, we spent a lot of time thinking about the scenarios where things could go wrong. And writing runbooks is something that Sarah touched upon on this morning. This extended upon our existing catalog of runbooks, and was a way to give us a clear, predefined, and level-headed, tested step-by-step process for when things went bad. We built specialized dashboards to get a single pane of glass, all the services and metrics that we actually wanted observed in a single pane of glass rather than looking through hundreds of dashboards. Also, the other key thing, beyond the service-level dashboard is we weren't just monitoring for like individual compute components or things like that. We're also monitoring for business metrics. Are people actually investing? How much have we raised? What is the average investment amount? Is the investment platform actually working correctly, and are people actually coming?
I'm happy to report that things went well. Fortunately, this is not a failure scenario. Welcome to our 36,000 investors. We saw £6.8 million raised in the first five minutes, and we sold out the entire £20 million round within three hours. What we learned as a platform team were three key things. Scaling has its limits. Even in a distributed system, all of our components are distributed and horizontally scalable. It's not going to get you out of every load related problem. You do need to actually test and make sure that you're not being bottlenecked elsewhere.
Treating software as just software is super key. The Cassandra bottlenecks we found, that's not stuff that you could just Google, or Stack Overflow give you the answer for. Actually figuring out where your bottlenecks lie is key, and sometimes you do need to dive into the Cassandra code, or the code of your database engine, to figure that out. That should be completely routine and normal.
And last but not least, continuously load testing. Even better if it's in the environment that you actually expect the load to go to, rather than a synthetic environment. Thank you very much.
Questions & Answers
Participant 1: You were mentioning that you guys reached 900 services, right? It's a huge number. But then I'm really wondering, as the number of services increase, you get to have different priorities. Which service to maintain? Which to build? This is really challenging when you have component-based teams versus feature teams. How does Monzo handle that?
Patel: That's a really good question. Each of our services, even before it's created as part of our change management policy, each service needs to have a code owner. And the code owner maps to teams. We also believe in the fundamental philosophy that services should map to teams. And this code owner tagging doesn't just lie in the development of the service. It also lies in the monitoring and the alerting of the service.
What we have is we've set up Prometheus and Alertmanager, our alerting system which hooks into Prometheus, to look at that data and route alerts which are not really high urgency. For example, “We're seeing a small increase in the error rate in Cassandra queries”, for example, to the right team, so that they get the visibility that their application might be continuously, or has started to behave, deviously.
I think that's super important, that each service is mapped to a team. Subsequent to that, we have a monorepo. Any engineer can contribute to any service. It's quite common that engineers will contribute to services outside of their team domain. Having this code-owner relationship means that you have a consistent team which understands the philosophy and the context of that service, and can provide appropriate guidance in requirements and feedback as well. So, having this code owner label tie into the alerting, the reviewing, the merging policies, and everything like that for services is super paramount to that.
Participant 1: Does this mean that when you go ahead with trying to implement all those new services, you build new teams? Because, eventually, some services will just be at maturity level where it's okay, it's fine. You just maybe need to maintain it. You need the ops part. And then you will be really struggling with new stuff that you want to build. Other than that, you will just have unlimited number of teams, 900 teams.
Patel: Absolutely, yes. I think it's unsustainable that all services map to new teams. What we do is we have a team structure that is based on like priorities, company priorities, but all services need to be mapped to an eventual team. So, for example, a lot of teams like the financial crime team is going to be a given. There is going to be a financial crime team or people who are familiar with that financial crime domain. And then if they form individual squads, it's up to them to distribute their services amongst the squad. So yes, it is up to the teams. We're not going to end up in a scenario where the platform team manages everything that's left behind as a dumpster file. Yes, that's not sustainable.
Participant 2: Thank you for the talk. My name is Ben. I just have a couple of questions. How many releases do you do per day, production releases do you do per day?
Patel: Production, yes, about 70 to 80 and growing fast. There is no limit. When I checked this morning, after the keynote had started, there had already been 10 releases into production.
Participant 2: And you guys use Kubernetes, which frees up time for the developers. So apart from the infrastructure team which looks after the Kubernetes, would you expect your developers to also have a good understanding of how they're functioning? Let's say, I don't know, there's an incident overnight and you want to ...
Patel: Yes. It's a mixed bag. As a platform team, we don't believe that engineers need to know all the fundamentals of Kubernetes. In fact, we built tooling so engineers don't have to interact with Kubernetes in order to ship on a day to day basis. Stuff like rolling updates, and monitoring, and stuff like that, is all part of the tooling that available. If engineers do want to learn more about Kubernetes, that option is definitely available. As a platform team, we essentially maintain the Kubernetes infrastructure at Monzo. A lot of engineers actually don't have direct access to Kubernetes from a security standpoint, which is perfectly fine and fair. And yes, so essentially, not tying ourselves too far into the... Kubernetes is just a piece of software that we use to provide infrastructure for the rest of Monzo. You don't need to know all about the intricacies of Kubernetes or rely on any one specific part of Kubernetes for your service you. You shouldn't have affinity to Kubernetes.
Participant 3: Brilliant talk. Thank you, first of all. A question on load testing. You mentioned that when you’ve come up with your own framework, and you didn't care much about the data itself that's returned as long as it's returned. I imagine you relied on response codes from these 100 or 1K. But then if you're not asserting on that data, how do you know if it's not corrupted, if it all came back as expected? Then you're getting good TPS, good response times, but then the data might be ...
Patel: Yes. That is a good point. Unless we had a human verify that your feed or your transaction history is correct, and we haven't returned zero balances for everyone, it's very difficult to assert that. We have continuous integration testing and acceptance testing to catch the software side of things, and we have good error handling to ensure that things are either returned correctly or not. But, yes, we did not test for like data integrity. So, if we had an off-by-one error, we didn't ask for that.
See more presentations with transcripts