EC2 spot instance requests can fail. Often the root cause is no
availability of spot instances in the given availability zone.
This commit teaches our CI system to recognize when a spot instance
request has failed and to retry it.
To do this, we create a pair of new Lambda functions, "spot instance
request monitor" and "start job." The former is invoked periodically
via CloudWatch Event. It scans all jobs waiting on a spot instance
request. If the spot instance request is cancelled, it calls into
the "start job" function, which simply calls an internal API for
trying to start a job given a job ID.
Our strategy for retrying spot instance requests is to try to use
the "next" availability zone for the given region. We will cycle
through availability zones until a spot instance request is
acted on. In my testing, this always works. Although it could
take several minutes to cycle through all availability zones and
for an availability zone to have instance availability.
There are definitely other strategies we could try. For example,
we could try another AWS region. Or we could launch an on-demand
instance after N failures. Or we could launch a difference EC2
instance type - one with hopefully more availability. Or we could
use spot fleet requests (which allow launching from multiple
instance types). The easiest is probably to launch an on-demand
instance. But until we need this functionality, I'm inclined to
not build it, as I like the consistency of always using spot
instances.