When you’re operating large-scale industrial software, there are a lot of things that can go wrong. The tricky part is to locate where, in the complex maze of code and infrastructure, the issues are located.
Such was the case with one of our clients: Hemiko. They noticed the following problems with their application:
We were tasked with finding solutions to these challenges, ultimately lowering our client’s infrastructure costs by 40%. Along the way, we started with code optimization, then moved deeper to optimize the application’s underlying cloud infrastructure.
But how exactly did we solve their problems? Read on to find out!
First things first, a little background about our client.
Hemiko is an industrial company in the United Kingdom that designs, builds, and operates bespoke district energy networks for new and existing developments.
Throughout our cooperation, we’ve provided backend development, QA, and DevOps services to them. Keep reading to learn the details.
We started out by analyzing the code of Hemiko’s application and optimizing as much as we could. Surprisingly, that didn’t solve our app performance issue.
Code is the first place you look when something goes wrong with software. The codebase is crucial. However, it’s not the only source of possible issues. There is a whole world of potential problems lurking in the infrastructural maze behind any modern application.
Software with pristine code, but deployed on an unscaled infrastructure, will never perform as well as it should. That would be like putting a high-end motorcycle engine in an old, heavy body of a car—physics simply won’t let it go fast.
Even when you’re working on cloud infrastructure, you’re still running on machines—and those machines can be reconfigured to provide optimal performance for your specific application. Even developers who are masters at implementing business logic might fail when they need to optimize infrastructure.
There’s a huge body of knowledge about infrastructure optimization, and when it comes to the cloud, the tools are relatively new. Many developers simply don’t know them.
So, a new philosophy and specialization was created in order to solve these issues in IT projects: DevOps.
If you’re not a developer, it can be quite difficult to pinpoint exactly what DevOps specialists do. Let’s clarify that before we move on—but if you don’t need an explanation, go ahead and skip to the next point!
The term “DevOps” originated around 2008. It’s closely related to Agile, because it’s about taking Agile principles and applying them beyond writing code, in deployment and long-term maintenance.
Traditionally, the IT department used to be divided. There were:
For many people, it was clear that this was dysfunctional. They rallied around the DevOps cause, promoting it as a key addition to IT departments.
DevOps is the bridge between software development and IT operations.
The job of DevOps specialists is to create and/or implement tools and workflows that reduce the number of manual tasks necessary to deploy software, make changes, add new features, and ensure smooth performance for all use cases. This is called continuous integration and continuous deployment (CI/CD).
And that’s just one of the possible jobs, because DevOps specialists can also take care of:
It’s kind of like a special ops agent in the military. The job is delicate, requires broad expertise, and tiny mistakes can have big consequences.
Now that we have this covered, let’s get back to Hemiko.
Ultimately, Hemiko’s case is a great example of what can happen when you add DevOps to your project.
As you recall, we started out by optimizing the code of their application. That didn’t generate the results we wanted. So, the next step was to inspect the software from a DevOps perspective. Pretty soon, we received confirmation that this was the right direction.
The first thing we took care of were several critical configuration items in the infrastructure. This led to noticeable improvements, so we continued along this path.
Had Hemiko decided to keep looking for improvements in the codebase, they would have wasted a lot of time. Plus, the DevOps issues would have kept eating away at their budget, because cloud computing is by no means cheap—especially when it’s not optimized for your unique case.
Hemiko’s infrastructure is comprised of several individual nodes/machines linked together in a single cluster that provides the computational and storage power needed to run the app.
We knew that the cluster, along with the database, needed to be optimized. But we still had some detective work to do. Now that we knew we were on the right path, we started deploying our complete DevOps toolset.
One of the main things we did was creating a robust monitoring solution to precisely analyze the dialogue between the application and the infrastructure. Why did we do that? Because one of the pillars of DevOps is observability.
Observability is a principle that says: everything you implement should be properly monitored. At least that’s the simplified version of it.
The concept comes from control theory, which is the study of controlling systems in engineered processes and machines. Therein, observability is a measure of how well you can describe the internal workings of a system based on the outputs that the system generates.
And that’s why we had to implement complex monitoring—to see exactly what was going on with Hemiko’s system. This allowed us to collect metrics from the infrastructure and pinpoint the issues. What did we find out?
We noticed that the Relational Database Service (RDS) wasn’t configured correctly. Here’s what was going on:
You can see this happening in these charts. The first one shows when burst credits run out, and the second one shows how it influences the database:
This caused unintentional throttling, limiting the amount of resources available to do what the application wanted to do. In these situations, the application would stop responding.
We solved this with several changes to the RDS configuration. After the optimization, the new database has a baseline performance of minimum 6,000 IOPS (and not 2,000), and doesn’t use IOPS bursts. This way, operations are spread more evenly and performance stays stable.
After optimizing the RDS, we found out that worker nodes responsible for storage couldn’t keep up with the amount of data coming in and out of the database.
Hemiko’s application is sensitive to IOPS and latency, so the underlying infrastructure can’t stall. Here, because of suboptimal configuration, the worker nodes—literally the machines and drives that handle data write/read—were queuing operations for much too long. The application had to wait until the physical drives spun all the queued requests.
To be more specific, it took 6.6 minutes on average for data to be written on the drives, which dropped to 1.04 minutes after our optimizations. That’s a 635% improvement, along with 807% improvement in terms of queuing.
Ultimately, the solution was rather simple to implement, but very hard to find—we had to scale up worker node storage.
With the storage optimized, it was time to look at the processors. Our monitoring tools showed that there was a lot of throttling going on here, as well.
The problem was that process limits weren’t set properly. When the limits are set too low, the application can’t take advantage of the cluster’s whole power, so it stalls. It’s like trying to fit a whole loaf of bread into a tiny panini maker—not a great way to make a grilled cheese sandwich.
There’s a thing called CFS (Completely Fair Scheduler), which automatically does some of the work of assigning CPU time to a process. But its main goal is to protect the infrastructure from a crash. Regardless of the CFS, when the limits are set too low, the CPU’s power goes to waste.
After our optimization, the CFS scheduler started distributing compute time more evenly. After that, we saw only one CPU throttling instance that we quickly solved, and no more throttling since then.
We already solved a problem with storage write/read queues, so now we had to check if SQS (Simple Queue Service) processing time improved after our updates.
These queues enable the application to deal with requests in an asynchronous way, and queue requests in a smart way to keep the system running at optimal performance.
When you see a large increase in queue processing time, it means something isn’t right. This makes SQS queue processing time a good target for end-to-end tests of your infrastructure.
Here, before our improvements, the oldest queued message was 8 hours and 51 minutes old. Total processing time was about 22 hours and 15 hours until the last queued message.
After we made our improvements, and in the month after our optimizations, processing time dropped to as low as 2 hours—an improvement of 1,100%.
As you can see, DevOps really is a kind of special agent role that requires broad knowledge.
During this engagement, we were able to optimize Hemiko’s application code, build robust monitoring for the underlying infrastructure, and remove several critical bottlenecks.
Ultimately, we improved the performance of our client’s application by over 1,000% in some areas and reduced monthly costs for their cloud cluster by 40%.
We’re still working with Hemiko, continuously looking for ways to optimize and improve their system.
To wrap up this case study, here’s a quick recap:
We can leverage the skills of our DevOps experts to optimize your infrastructure and lower your costs, too.
If you notice laggy performance in your app, we’d be more than happy to help you out! All you have to do is tell us about your project.