How embracing failure can help designing a better system

Headshot of Aleksa Vukotic.

Since the school days we have always been advised to be proactive (compared to reactive) – rather than wait for the failure to happen, anticipate the problem, and solve it before it affects anything. And there are clear benefits of such approach – less (but not none) failures.

When it comes to operational approach to IT systems, the aim (not always successful) has been to be proactive as possible – redundant systems, early alert of performance degradation, predictive analysis of system statistics. There is a great range of potential tools that are available today for such tasks – hardware monitoring systems that measure cpu/memory/io stats over time, network monitoring tools that can measure response times for every peer-to-peer integration point,  instrumentation-based monitoring of your application internals, etc.

However, even with such a great set of tools, and constant innovation in the field, there is one truth that we can’t escape from – failures still happen. Hardware is unreliable, network is flaky, software bugs are a reality – these are hard truths that anyone who has built and run software in production should be able to vouch for. Therefore, even with the the best tools in the world, with the will and resources for proactive monitoring and management of complex IT systems, it can all fail miserably when encountering real failure in a production system.

What has been traditional approach in case failure is detected in the system? It varies from organization to organization, the process usually involves incident management calls, involving personnel from developers and ops/devops to product owners and customer service representatives trying to i) short term fixes – firefight the problem to return system to normal service; and at the same time ii) understand the root cause so it can be addressed and future incidents prevented.

It is important to have a robust process for dealing with failure – but it’s expensive and time consuming as well. The firefighting bit is perfectly reasonable – no much reason to try to repair the houses or build new ones while the ‘London is still burning’. However, the understanding of the root cause and prevention of the similar problem in the future is unmeasurably more important (hands up anyone who has been involved with firefighting the same issue more than once? More than twice? Exactly.) The usual scenario is that firefighting takes most of the time of anyone involved on the day, so the there is less time and concentration to do the all important root cause analysis and future prevention – increasing the time and cost of the exercise.

Is there a better way to approach system failure situations? And do not mention prevention – as we already discussed, failures do and will happen, typically independently of our actions and at most inconvenient time possible.

How about if we designed our systems in such a way that it does the firefighting bit on its own, independently and without our help?

Then again there is another type of production failures that are relatively frequent – failures of external systems. It’s not just our product that suffers from universal truth that hardware and software failures are inevitable – every other systems we interact with is as well. Even Google’s services fail occasionally (rarely, but very visibly if you follow any kind of social media).

And here is another truth – we can’t do much about it. If our systems depends on external one – failure of the dependency will invariably affect our system – there is no way to proactively avoid it.

However, there is something we can do – design our system in such way that it can handle failures on its own. Allow external services to fail, with full expectations that parts of our product which depend on it will have degraded functionality – and let our customers know. Then, when external service incident is resolved (you can expect Google to be quick at it, other providers maybe less so), our system should spring back to life as if nothing had happened – importantly, without any manual intervention.

Development best practices and devops techniques go a long way in making failure detection and recovery efficient. I won’t muse about those here. Instead, I’d like to explore to system design can help us to handle failures in various scenarios.

The two scenarios I described above can be solved by different approach to system design – using reactive systems. Ideas underpinning reactive systems design have been around for a while, but gained traction with recent popularity of microservices architectures, containers and cloud infrastructure.

As described in the reactive manifesto (https://www.reactivemanifesto.org/), reactive: Responsive, Resilient, Elastic and Message-Driven.

Let’s think how each of the characteristics affect the failure scenarios we described before, thinking about design of the system of the whole:

Responsive – responsive system will give response to the user even under duress – if any user request cannot be served, we will at least be sure to let user know, without hanging and potentially exhausting resources that would bring entire system to a halt. For our scenarios, this would minimise the number of firefighting incidents – system will never be completely unusable, so we can concentrate our efforts on the parts of it exposed to failure.

Resilient – system should be resilient to failure as much as possible: failure of one part should not impact other components and in case of external failure system should be able to self heal. Think of the compartmentalization of large ships – in case of hull breach, each of the compartments will fill with water. But because compartments are separated, water won’t fill entire ship and ship can continue to sail until at least next port or destination. The key techniques used to build system with resilient characteristics are: i) component isolation (so that the failure of one component does not impact stability of the system as a whole) and ii) replication (so that in case of component failure, there are other instances of the same component that can perform task at hand).

What does it mean for us? Failure of single host, disk, application, process would not affect the system as a whole – other replicated instances will continue serving users as normal. In case of external service failure, parts of our system may well be affected – but given the isolation of individual components, the failure will not cascade to other parts of the system which will continue working as normal

Elastic – it should be possible to increase (or decrease) capacity of the system by changing number of deployed services in any modern software system. Reactive system will go one step further and react to changes to load automatically – by adding and removing resources as required. If you expect higher load to your system at certain time (think last day of the month for payments or tax submission deadline day), by allowing system to be elastic and increase it’s resources – be it threads, processes or even machines, one can avoid failures due to increased demand (and all incident phone calls that would follow it)

Message-driven – to be truly reactive, systems should allow location and time transparency, by communicating asynchronously via message-passing. There are many nice features that come from fully async systems – no cascading of failures, natural load balancing due to location transparency, easy flow control using backpressure. In many ways, message-driven nature of reactive systems complements all of its other key characteristics – responsiveness, resilience and elasticity.

That said, I should also acknowledge that it’s extremely difficult to build a fully message-driven reactive systems in some use cases. It’s a topic for another blog, but some security, data retention rules and short-time-span features are sometimes hard to marry with async messaging, and fit more naturally with sync communication, using REST for example. But regardless of that, some parts of the asynchronous nature should always be strived for  – e.g. non-blocking communication, location transparency and internal backpressure mechanisms to name a few.

These ideas are not new, but are important in order to build a any scalable computer system. Even more important in the technology startup world, where we usually start small, but can easily grow big, as large as world-wide-web infinity would allow. Being able to start nimble, but use same system design easily grow to large scale is a great feature to have.

And that’s all before any implementation detail – you’ll notice that we haven’t mentioned any typical buzzwords yet: functional reactive programming, streams, or any other language or technology. While tools and languages are important, it’s more important to realize that well-designed system can be implemented in many ways, using different tools and technologies – one should be able to easily change or replace any component, code can be refactored or rewritten in another language; or decide on technology stack in reaction to resources availability, team size or innovation in software development practices. By getting the design decisions right, it is possible to  build a POC or MVP that simply works, and at the same can evolve and be extended to something much larger.

This is where reactive systems designs shines. And its embrace of failure and design for failure mantra are the key strengths you should use.

 

Read this…