5 things Railway’s 8 hour outage should change about how you think about redundancy
Railway experienced an eight-hour outage due to an incorrect suspension of its Google Cloud account, rather than a traditional cloud failure. This incident highlights the importance of considering account management and control plane dependencies in redundancy planning. Many teams focus on hardware and regional failures but often overlook the risks associated with automated account actions.
- ▪Railway's outage was caused by Google Cloud incorrectly suspending its production account.
- ▪The incident affected not only Google Cloud but also the routing control plane, which was crucial for service continuity.
- ▪Many redundancy plans do not account for the possibility of an account being suspended by automated systems.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 2629801) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } bishwas jha Posted on May 22 5 things Railway’s 8 hour outage should change about how you think about redundancy #aws #runway #architecture #gcp Railway runs on Google Cloud, AWS, and its own metal. So when I first saw that Railway was down for hours, my first thought was probably the same as yours. "How does a multi cloud platform go dark like that?" Then I read the incident report, the Hacker News discussion, and the follow up coverage. And the real lesson is uncomfortable.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).