This is an archive of the discontinued Mercurial Phabricator instance.

ci: retry expired spot instance requests
Changes PlannedPublic

Authored by indygreg on Sep 30 2019, 11:57 PM.

Details

Reviewers
None
Group Reviewers
hg-reviewers
Summary

EC2 spot instance requests can fail. Often the root cause is no
availability of spot instances in the given availability zone.

This commit teaches our CI system to recognize when a spot instance
request has failed and to retry it.

To do this, we create a pair of new Lambda functions, "spot instance
request monitor" and "start job." The former is invoked periodically
via CloudWatch Event. It scans all jobs waiting on a spot instance
request. If the spot instance request is cancelled, it calls into
the "start job" function, which simply calls an internal API for
trying to start a job given a job ID.

Our strategy for retrying spot instance requests is to try to use
the "next" availability zone for the given region. We will cycle
through availability zones until a spot instance request is
acted on. In my testing, this always works. Although it could
take several minutes to cycle through all availability zones and
for an availability zone to have instance availability.

There are definitely other strategies we could try. For example,
we could try another AWS region. Or we could launch an on-demand
instance after N failures. Or we could launch a difference EC2
instance type - one with hopefully more availability. Or we could
use spot fleet requests (which allow launching from multiple
instance types). The easiest is probably to launch an on-demand
instance. But until we need this functionality, I'm inclined to
not build it, as I like the consistency of always using spot
instances.

Diff Detail

Repository
rHG Mercurial
Lint
Lint Skipped
Unit
Unit Tests Skipped