Issue
Connections to external services, including APIs/other websites, Postgres, Redis, or externally-hosted data services, are timing out or being closed unexpectedly.
Resolution
Connections to external services may encounter difficulty for a number of reasons:
If an app's dynos are on a multi-tenancy tier (Free, Eco, Basic (formerly Hobby), Standard-1X, Standard-2X), other dynos may be consuming more resources than usual, leading to all dynos having less total performance due to CPU or other resource contention. Note that it is extremely unlikely that more than one dyno from a single app will ever be on the same host, at the same time.
Dynos on multi-tenant hosts are allowed to "burst" to higher levels of performance where there are unused resources to spare, but when many dynos are active at the same time, they will all experience lower performance levels. CPU scheduling is then divided up by dyno size and allotted CPU share. In addition to these issues specific to multi-tenancy, a given application's dyno could also be impacted in the same way as single-tenant dynos.
In cases of single-tenant dynos (Performance dynos and all Private Space types), these errors can be encountered for their own reasons. Applications with high concurrency, such as too many threads per process, can exhaust available resources. Many threads working under high load at the same time may cause a CPU backlog, causing a thread to be unable to respond to network traffic in time before a timeout is reached, whether that timeout is the limit set by the Heroku Router, protocol-specific timeouts (DNS - 2 seconds, Postgres - 30 seconds, Redis - 300 seconds), or timeouts set on a given operation within an app's code.
It's also possible that requests routed across the internet are impacted for any number of reasons that are not able to be specifically diagnosed. In order to make apps more resilient, apps should have error handling in place for all external requests, with at least one retry. If permitted within timeout limits, additional retries can be performed, preferably with a random wait (sleep) of some random fraction between 0-3 seconds, after 2 or more failures. If a task must retry until it succeeds, we recommend using an exponential backoff for each additional retry, with a reasonable maximum value such as 3 or 5 minutes, before trying again.