AI-Powered Self-Healing Cloud Infrastructures: A Paradigm For Autonomous Fault Recovery
Abstract
This paper is about self-healing cloud infrastructures equipped with artificial intelligence to enable autonomous recovery from unforeseen runtime faults. As cloud-in-a-robot or robot clouds become a reality and promise to go beyond the cloud paradigm by enhancing edge computing platforms to deliver ultra-low-latency services to users, making them self-reliable is crucial. Although there is plenty of prior work on building robust deep learning models and deploying them in the cloud, there is a lack of comprehensive and systematic real-time fault recovery frameworks. This paper provides a detailed delineation of the different challenges and key aspects that are overlooked in prior work on building resilient cloud infrastructures. We present a system design of AI-powered self-healing cloud infrastructures that applies AI to different levels, including autonomous fault detection, reasoning-based fault diagnosis, and many techniques that use deep reinforcement learning to ensure expedited repair times.
Metrics
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
CC Attribution-NonCommercial-NoDerivatives 4.0