Charity calls herself an Operations Engineer. Moving from one place to the next, her career revolves around taking care of computers, and she seems to usually end up in charge of databases in most jobs.
She has significant leadership experience, since she ran operations at Linden Lab as Systems Engineering Manager and was Infrastructure Tech Lead at Parse, a mobile backend-as-a-service company eventually acquired by Facebook. Currently, Charity is the Chief Technology Officer and Co-Founder of Honeycomb—a six-year-old startup promoting observability in software engineering.
Charity’s challenges at Parse inspired Honeycomb’s focus on observability. Parse provided mobile backend services to developers—a novelty on the market at that time—and their team had to offer this service with fewer tools and languages.
Because of that, the platform was frequently down, and it was always a struggle to locate the source of the problem. That is, until they discovered a tool that reduced the amount of time it took them to identify new problems drastically—from hours and days to minutes and seconds—and made the team understand their systems.
As a result, Charity realized that the tools built for the previous architecture and infrastructure were designed with predictable infrastructure and required constant monitoring to figure them out, whereas the tools required for modern infrastructure needed to be ephemeral, dynamic, and transient.
This inspired Honeycomb’s focus on observability—a switch from the previous generation of tools built to be monitored and dealt with reactively to observable tools that can be understood and assessed externally with little monitoring.
This article covers our session with Charity extensively. However, you can watch the full interview via the link below if you’d rather hear her discuss the subject directly:
Software standards have changed over the years. Users and developers are now drawn toward tools that are easy to use and less monolithic, as downtimes and system malfunctions are inevitable. However, many teams still struggle with building better software that is easy to fix.
If you’re wondering how to make your tools easy to fix and use, Charity recommends the following strategies:
According to Charity, “From a software engineering context, observability is the ability to understand any internal system externally.” Technically, the term comes from control theory, and it’s a mechanical engineering term for control systems.
Previously, tech teams relied on static dashboards and metric-based monitoring tools to understand systems, but were unable to analyze systems when they malfunctioned using these dashboards and monitoring tools.
As a result, they had to supplement the metrics with data on system malfunction that they manually collected and recorded whenever a system malfunctioned to predict the source of that system malfunction. However, their predictions were unreliable, seeing as they were guesswork at best.
With systems becoming more complex and serverless in recent times, there is a growing need for a more systematic approach to understanding systems that observability provides.
Charity also submits that “observability is important because it connects engineers with their users’ experience.” Developers previously had to consult the operations team to understand monitoring tools. However, observability speaks to the engineer in a language of endpoints and variables that they understand.
In addition, although infrastructures have become more resilient and reliable these days, developers still need to understand how their code works and how their users are experiencing it to fix bugs swiftly. Your team may even need to log more information to understand their code. With metrics-based monitoring, adding more information comes with extra cost, but with observability, it is effectively free.
Observability relies on three pillars—cardinality, dimensionality, and explorability—to make systems understandable.
For better context, cardinality refers to the values and the number of unique values in the set. For instance, if your platform has 100 users, its highest cardinality could be a dimension that is a unique ID, such as a user’s social security number. The lowest cardinality could be species, since all users belong to the human species.
Previously, tools for understanding systems could only handle low cardinality dimensions, making it difficult to identify and fix bugs swiftly. That was because the most valuable information when debugging will almost always be the high cardinality. Hence, your team can implement high cardinality when coding to make their system easy to debug.
The second pillar, dimensionality, refers to the number of dimensions in an event. With metrics, you would only have a few dimensions, but with observability, you can have hundreds of dimensions strung together.
This comes in handy when you’re trying to find outliers, because you’ll see spikes of errors on your dashboard, unlike in the past, when you’d have to guess the source of the outliers. Hence, your team can initialize an empty structured event and populate the event with as many dimensions as possible to make it easier to figure out what is failing.
Finally, explorability refers to the state of being open-ended. Dashboards of the past tended to be static, and you could not dig deeper with them. Observability creates an explorable way of dealing with system data tally, allowing you to go back and reconstruct and correlate all the outliers using instrumented database queries collected and stored in wide structured logs. This way, you can go back to reconstruct and trace the challenge instead of jumping to different calls.
Essentially, you can only use your systems to the extent that you understand them. Charity says, “You can understand your system without prior knowledge or shipping custom codes if you adopt observability.”
Charity recommends implementing observability earlier in the process. Unfortunately, people write code without considering what happens once it’s shipped. With observability, however, when you get into the rhythm of instrumenting, shipping, and ensuring that your code is performing as expected, you’ll make fewer errors, and you’ll have cleaner code and more understandable systems.
If you raised the concept of on-call developers years ago, you would receive much pushback from the development community. But systems have gotten complex, and there may be no better way to understand and build systems than engineers being on-call for their code.
In Charity’s words, “Working on-call makes developers more in tune with the users, write better code, and conversant with the system’s performance.” Overall, being on-call makes people better engineers and may be inevitable, since, according to Charity, “Anyone who builds a 24/7-available service should be willing to be woken up a couple of times a year.”
However, she also believes that nobody who runs a highly available service should be woken up more than a few times a year. In other words, your developers shouldn’t find working on-call severely life-impacting or dreadful. Here are some tips that should help you implement on-call work successfully:
Asking developers to work on-call should not be treated as an avenue to get them to work overtime. You should ensure that your team has enough time to fix whatever is wrong outside their call time.
Also, people on-call shouldn’t be expected to ship features or write code. Their call period should be dedicated to interacting with the system, tidying up, and fixing things.
You should also encourage developers to use their best judgment as software engineers to do something different and get some really meaningful work done while on-call.
There are as many different call rotations as there are teams. Traditionally, you could be on-call for a week. You could also rotate the platform and product teams. The platform team would be much more focused on reliability, while the product team would focus on features, interaction, and bug fixes for supporting customers directly. Customer success or internal teams that interface with customers could also be on-call and rotated weekly.
You can also take advantage of the difference in time zones if you have a distributed team or a “follow-the-sun rotation strategy,” as Charity calls it, where each team member has 12 hours on at night in their location and 12 hours off during the day or half a week for three days.
The rotation your team adopts may also depend on how heavy your team’s workload is and whether the team is in a transitional phase or have different configurations lined up. Either way, picking the rotation specifically tailored to your team is best.
Charity also recommends adopting a shadowing strategy for call rotations where one team member with more experience being on-call would be the primary on-call team member, and new team members would be secondary. This way, they can always escalate things to the primary when they’re stuck.
Charity defines sociotechnical systems as the complex interactions and interweaving of humans, the software we build, and the software’s impact on us. This is a much better way of thinking about software, because coding is never enough. You also have to consider the people writing it. Here are some ways to treat software development as a sociotechnical system:
Charity recommends hiring not just people managers but managers with the technical skill set to get the job done and vice versa. Managers with the right skill set are more likely to consider both the code and the humans involved in the process.
It would help if you made sure that the dots are getting crossed out immediately. Also, when you have outages, allocate enough time to figure out what happened and fix the problem.
As an engineering leader, you’re most likely under a lot of pressure to ship faster, as there’s always something to fix. For starters, you should be able to estimate how long it takes each team member to ship a single line of code to production using the normal continuous integration and continuous delivery (CI/CD) process.
However, things change when your team members start waiting on each other. Your workload gets larger and it takes longer to review code and ship it to production. You may also need to attend to each team member’s code in batches and track special releases while still debugging.
Although your fastest team member may be able to ship one line of code to production within 15 minutes, this lineup of processes could create a longer interval before the whole code is eventually shipped to production. Hence, if you want to deploy software faster, you must keep the interval short and manageable. Here are some tips that can help you shorten the interval:
You can instrument your build pipeline as a trace to visualize it and see which process takes the most time, which tests are taking too long, and which tests need to be taken out. According to Charity, “It’s not the most exciting thing to do, but it’s not technically challenging, either. It just needs to be done.”
Charity recommends setting internal time slots for your team and ensuring that your team does not exceed the allotted time for each process. Whatever process exceeds the allotted time can be taken on by whoever is on-call to avoid using up the time for other processes.
It’s getting more and more difficult to run systems. Hence, technical teams have to figure out how to build and fix systems faster and more efficiently. To do that, those technical teams need to understand and assess their systems well.
They also need a sociotechnical approach to coding. Doing so guarantees building systems that are easy to fix and can be deployed faster. Luckily, the tips presented here should help you understand, assess, and build sociotechnical software.
Thank you for reading our article. If you found it interesting, we recommend that you check out the following resources on our blog:
Are you looking to boost your software development team or hire developers to work with you on-call? Then you should definitely take a look at the services we provide and reach out to us. We’re always willing and able to support you with successfully delivering your projects faster!