.Net, Azure and occasionally gamedev

Unexpected downtime

2018/04/11

From Tuesday April 10th 22:05 until Wednesday April 11th 11:30 my websites where unavailable and displayed only 502.5 error pages.

Root cause:

A background webjob got stuck in a continuous loop and ate 100% of cpu power and ~90% of RAM resulting in all web apps hosted within the same Azure App Service to error out with 502.5 error pages.

Turns out hosting all your websites in a single S1 instance (while arguable cheap) is really bad for uptime. I guess you get what you pay for.

What troubles me is that I received no notifications from azure about the downtime of all my webapps as well as my app service environment.

In fact when I looked at their respective app health they all looked like this:

platform health

I'm totally ok with the app service reporting OK even though its CPU and RAM usage are at the limit, but I did not expect that all webapps reported the same healthy status (after all, any http request was returning 502.5).

I've now added multiple manual alerts (high CPU/RAM, too many 5xx errors/hour, ...) to work around this issue.

tagged as Azure, .Net Core and Dev Ops