Getting the hell out of Amazon Web Services

If you’re a regular reader of my blog you may recall that in June I made the decision to move my production environment to Amazon Web Services. This was because I no longer fancied running my servers (however small and unobtrusive) at home. At first everything went well – the migration to the cloud was seamless and without any issues.

Then after five months, disaster struck.

On Tuesday 10 October, my email server suddenly fell off the face of the Earth. After logging into the AWS Console, I could see one of my servers was working well, but the other had failed both the system status and instance checks and was unresponsive. Restarting the instance didn’t solve anything, so after twenty minutes of the system status check failing, a button appeared enabling me to log an issue with AWS Support.

Whilst that was being dealt with, I stopped and then started the instance (therefore getting a different host), but that made no difference. I also tried deploying new instances from the same AMI, but they wouldn’t boot either. I also tried every other AMI I could get my hands on – all getting the same result.

It was fast looking like something had changed in AWS’ infrastructure, and it had crippled every NetBSD AMI available.

Shortly after, AWS Support came back with a less than helpful response. Despite crafting some clear and concise questions, they dodged every single one:

Me: Is there an explanation as to why my instance failed?
Support: AWS Community AMIs are not supported.
Me: Do we know why monitoring failed to detect multiple instances using the same AMI had crashed?
Support: AWS Community AMIs are not supported.
Me: Has anything changed on the underlying infrastructure that could have potentially caused this issue?
Support: AWS Community AMIs are not supported.

Whilst I fully expected AWS Support to hide behind the Community AMIs are not supported excuse at some point, I had no idea this would be all I got back.

AWS Support in action

Shortly after my Apache box crashed too. In over twenty years of running my own servers, sixteen of then on NetBSD, I had not had one failure. Now, in less that six months with AWS, I had lost both boxes.

Monitoring

Whilst I fully understand why community AMIs aren’t supported, I’m disappointed that the system status check wasn’t picked up on. Whilst NetBSD is rare on any estate, even the Cloud, this issue has clearly affected every user of this OS on the platform. If I was monitoring the estate, I would like to think I would have included some logic that if multiple instances begin to fail and they’re all from the same OS, then generate an alert.

Clearly this didn’t happen.

Automation

All systems need to be patched, and cloudy hosts are no different. However, it’s imperative that thorough testing is done wherever possible. AWS have clearly taken the  line that supported AMIs are tested, and community ones can get stuffed.

Apparently there are a few thousand AMIs, and no matter which provider you are working with, it’s unrealistic to expect every one to be tested manually. In the same manner you can’t expect your two-bit hosting provider on the High Street to test every component to death, larger players (such as AWS and Azure) have no such excuse. They have greater resources than anyone else, and I cannot find a compelling reason why all AMIs, supported or otherwise, couldn’t be tested using an automated pipeline. All they have to do is spin an instance up, check it lives for more than two minutes and has network connectivity, then tear it down.

If we can continually churn machines and test them with vRealize Automation here in HobbitCloud, I don’t see why AWS can’t do the same.

Running a community AMI is like running your home lab on non-HCL kit, with no power-loss prevention and no backups – all with a company who has less than zero interest in supporting you. In other words, like playing Russian Roulette.

With a Glock 17.

Moving Forward

Thankfully when I moved to the cloud I designed for site failure, so all is not lost. The secondary site is running Postfix and is configured as a higher MX, and mails will continue to queue there until the primary is back online.

All Dovecot mailboxes are also replicated to the secondary site, so when I build a new primary I will simply reverse the replication flow.

This whole experience has put me off running mission critical workloads in the Cloud. Whilst I fully acknowledge the NetBSD AMIs are unsupported, I would have at least liked an explanation from AWS as why the issue occurred. If they can’t even take a few seconds to help me understand why, what faith do I have they will give me the support I need should I decide to stay – albeit with a supported AMI.

Which just leaves the question – where do I move to?

One thought on “Getting the hell out of Amazon Web Services

  1. I’m still glad I run my home lab at home. I signed up for a free year of Office 365 but still haven’t used it yet…and that’s for a grand cost of zero. I may be running a lab at home for fun/learning but I still don’t trust my servers/VMs running on some shared/public cloud infrastructure.

    Your experience with the “support” you received is exactly one of the reasons I don’t use the Cloud. At least if I have issues at home its due to me and I can fix them and learn from the experience.

    Count me out of any public cloud!

    Keen to see where you go from here. I was considering Azure at one stage to get some experience but would only use it for testing and not my prod servers.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s