Dealing with a Production Incident After Midnight
▻https://hackernoon.com/production-incidents-at-night-postmortem-c9233ec1de7e?source=rss----3a81
by Dominic FraserA night worker on a very different infrastructure project takes in the skyline — Shanghai, ChinaFrom 21:22 UTC on Wednesday December 12th a sustained increase in 500 error responses were seen on Skyscanner’s Flight Search Results page, and in the early hours of the following morning 9% of Flight Search traffic was being served a 500 response.As a junior software engineer this was my first experience of assisting to mitigate an out-of-hours production incident. While I had previously collaborated on comparable follow-up investigations, I had never been the one problem-solving during the night (while sleep-deprived) before! This post walks through some specifics of the incident and, by describing the event sequentially (rather than as simply a summary of actions), hopefully (...)