The dreaded CREATE_FAILED message can be all too common a source of frustration when deploying new stacks with CloudFormation. The AWS Console does show you which component in your stack has failed but if you have a heavy reliance upon metadata and userdata components more often than not you’ll only get a wait condition timeout error which gives you no indication at all as to what has actually gone wrong under the covers.
The good news is that there are some tips and tricks out there for troubleshooting CloudFormation stack failures. Some of the tips revolve around CLI switches, some around knowing a bit more about the CF internals and others about knowing where specific scripts live on your typical EC2 instance. This post attempts to document a few approaches to troubleshooting CloudFormation stack errors and help the reader to take a (somewhat..) structured approach to troubleshooting wait condition timeouts.
So your stack has failed, there are two typical scenarios:
I’m assuming that you’re running the create-stack operation using the AWS CLI. The first step is to disable the stack rollback. This means we can log into our failed instance and start troubleshooting. If the EC2 instance has failed due to being passed an incompatible subnet or availability zone it will not even attempt to initialize the instance. This error needs to be fixed in the json template or by amending your stack parameters. If it’s a wait condition Timeout then it’s likely the instance is up and running and has a working network connection.
1. Stop the stack from rolling back by appending the following switch.
aws cloud formation create --stack-Stack-name myStack --template-body file: ///myStack.json --parameters file: ///myStackParams.json --disable-rollback
2. SSH onto your EC2 instance by finding the IP address of your instance in the EC2 Console. For easier access you can include it as an output in your CloudFormation template and view it in the outputs tab of the CloudFormation console.
ssh -i myInstanceKey.pem ec2-user@53.x.x.x
3. The first place to check is the cfn-init log. Check here for any obvious failures.
/var/log/cfn-init.log
If no obvious errors are found let’s move on and check our metadata.
4. View the contents of the userdata script.
cat /var/lib/cloud/instance/scripts/part-001
Userdata is stored in a script. It includes any custom shell commands you wish to run along with your cfn-init and cfn-signal operations.
Find your cfn-init command (but don’t run it). It should look similar to the following:
/opt/aws/bin/cfn-init -s myStack -r myInstance --region ap-northeast-1
** Note that if using the new cn-north-1 region you need to append a “-u to your cfn-init command as cfn-init does not automatically find the Beijing region’s CloudFormation endpoint.
5. Taking the arguments shown above, run the cfn-get-metadata command. It will query the CloudFormation endpoint and allow us to check that our metadata is formatted correctly.
/opt/aws/bin/cfn-get-metadata -s myStack -r myInstance --region ap-northeast-1
Often a parameter my be incorrect or a variable badly formatted. This can result in a garbled URL or package name and consequently a command timeout. Here is an example of our metadata. Be aware that if you use IAM Roles (which I strongly recommend over IAM Users) you need to include the Authentication component which allows the instance access to any buckets or other AWS services you require. As shown below our cfn-get-metadata output confirms our Authentication is setup correctly and we can see our manifests bucket.
{
endpoint security el capitan endpoint security companiesTAGS
CATEGORIES