I have been working with AWS with many years and the service over the years has been outstanding. A few months ago I started the migration of (yet another business) to AWS, and we had an incident. This made me think that maybe AWS is starting not to be good any more.
I would like to share the postmortem report with the community, and please comment on what you think. I would like to know if we made a fundamental mistake or if AWS is actually degrading.
Times are UTC. Personal opinions are removed from the report, just facts are stated.
---------- POSTMORTEM REPORT
The project consists on moving several services to AWS. The system consists Services in 1 autoscaling group, and a PostgreSQL Database in RDS.
- Sunday 4.30 am: we migrate the PostgreSQL database to RDS. RDS is configured with 200 GB in the disk, the database size is 15 GB.
- Sunday 10.17 am: RDS detects that we are running out of space, and decides to grow the database from 200 GB to 999 GB. The RDS auto scaling event starts
At this point the performance if the database is degraded. Alerts are triggered.
A test performed from the VPC network with the query "SELECT now()" took 20 seconds and 248 milliseconds.
The database performance is that bad that many of the services goes down.
- The RDS auto scaling event finalized at 14:39.
After contacting with AWS Support (see details below) we decided to roll back.
We contacted with AWS Support (Business). The points more important in the transcript are:
- The fact that RDS decided to grow the disk from 200 GB to 999 GB, when the actual database size is 15 GB, it is not a problem.
- In the actual auto-scaling event a performance degradation while the auto scaling event is operational is expected. As "sessions"(1) are not being drop, from AWS the database is considered online so it is working as per expectations.
- Pointed out the example of a SELECT now() taking 20 seconds. This did not change the fact that the database is online and all is good.
- When asked for an estimation of the duration of the event, it was stated that could take "from several minutes to several days".
- The objective of AWS Support is to communicate what is happening (implies that you should not expect them to help you to fix the actual problem)
-------
Opinions ?