Self-Healing Software Systems

Artificial Intelligence (AI) has transformed the world of technology, enabling systems to learn, adapt, and make decisions without explicit programming. From autonomous vehicles to medical diagnostics and flight control systems, AI promises unprecedented efficiency and capability. However, when it comes to safety-critical systems—where failure could result in injury, loss of life, or significant damage—the use of AI introduces profound challenges that go far beyond traditional software engineering. Unlike conventional software, which behaves predictably according to its programmed logic, AI is built on learning and training. Its decisions and outputs depend heavily on the data it has been trained on and the patterns it recognizes during runtime. This adaptive, data-driven behavior means that an AI system’s responses may vary with changing inputs or environments, often in ways that are not explicitly defined or foreseen by developers. While this flexibility is a strength in many applica...

Self-Healing Software Systems

A software system which is capable of detecting and correcting its failures without human intervention is called a self-healing software system. Such a software system is highly dependable and fault-tolerant, thus improving quality, reducing cost and bolstering customer trust. A self-healing software continuously monitors any deviations from expected behavior and restores itself to normal operation once any variance is observed.

INTRODUCTION

Regardless of how perfect we design and build our software based system, there will always be some unforeseen / unexpected bugs or failures once the system has been deployed in production. Now as a software architect and a system designer, you have full control to design your system so that it responds quickly and efficiently to the inevitable failures. One of the design alternatives to incorporate resilience in the system is the capability to self-heal or the ability to recuperate from failures.

It must, however, be noted that recovering from failures is often not enough and human intervention is still required. While the system is recovered to its normal state, the cause of failure still needs to be investigated or the bug still needs to be fixed. For this reason, self-healing often goes side by side with investigating the causes of the failure.

COMPONENTS

A self-healing software system typically includes the following two components:

1. A Monitoring Component proactively and continuously monitors the system to check if there is any anomaly or deviation from expected functionality. Few examples include using logging, time-to-live (TTL) or ping to monitor the server, network, CPU, hardware, application performance, exceptions thrown and processes terminated by OS.

2. A Restore Component takes necessary actions to restore the system back to normal operation. Few examples include retry, reboot, fault masking, roll-back, graceful degradation, configuration changes and switching to redundant hardware. A restore action maybe reactive (after a failure has been detected) or proactive (predicting the failure before it has happened).

BENEFITS

Self-healing capabilities in software system helps to save a lot of cost and time required to fix a failure during production. This has become highly critical research domain for big IT companies, where system downtime has a big cost for the business, resulting in loss of customers and reputation. Big companies like Google, IBM and Facebook are investing a lot of revenue in this domain.

EXAMPLES

A monitoring component logs exceptions resulting in application crash and a restoring component restarts the application while also sending a notification to the developers to fix the bug.

A monitoring component logs null pointer exception resulting in application crash, and a restoring component encapsulates the buggy code with a null pointer check, builds, deploys and restarts the application.

A monitoring component considers it a deviation from normal operation if the server CPU usage remains above 80 % for 2 minutes, triggers the restoring component, and the restoring component routes further requests to a redundant server.

A monitoring component sniffs that a network or database connection failed to establish, and the restoring component retries to establish the connection again.

A monitoring component finds that a process has been terminated by Operating System and triggers the restoring component to restart the process again while also sending a notification to the developers to analyze the causes of failure.

A monitoring component might observe that memory usage has reached a critical point of 90% and triggers the restoring component to scale or restart the specific service using the memory while sending the notification hoping that developers would fix the issue.

CONCLUSION

Software applications run on a wide range of hardware ranging from mobile devices to cloud-based clusters. Users of cloud-based applications, such as Google Translate, expect reduced response times and 100% availability. These expectations can not be met by old software architectures. One of the core design principles in modern software architectures is resilience, so that the systems are significantly more tolerant of failure, and when failure does occur, they meet it with elegance rather than disaster. Popular cloud computing service providers like Amazon, Oracle, IBM, Google and Microsoft are investing heavily in self-healing and self-managing technologies for incorporating efficiency in their services.

Software Engineering for Safety-Critical Systems

Search This Blog

Challenges of Using Artificial Intelligence in Safety-Critical Systems