My application is getting H12 timeout errors and the following are true:
- Using Unicorn or Gunicorn for my web server
- Gunicorn logs show
[CRITICAL] WORKER TIMEOUT
- Unicorn logs show
Other possible symptoms:
- Increase in database connections
- Timeouts may correlate with restarts or dyno cycling
The below explanation applies to both Unicorn and Gunicorn web servers.
Your Unicorn master worker is timing out and killing web workers. With (temporarily) fewer web workers, you app cant handle as many requests, hence the H12 timeouts.
Sometimes this can turn into a bigger problem during restarts. Process forking done by Unicorn can be long/expensive to start up. If these workers take too long to start up, or take too long on their first request, Unicorn will kill the processes and start up a new worker. This can result in a seemingly unexplained death spiral of workers being killed on startup, and when there arent enough workers to handle requests, requests timeout, and workers are once again killed.
Since the workers are sent a SIGKILL (no SIGTERM) they aren't able to log any errors during this time, so most of the time you'll just see a SIGKILL in your logs when this happens and nothing else.
Heroku and tools like New Relic are unaware of this happening, so it can seem like you're getting H12 timeout errors for no reason. It's an unfortunately harsh behavior of Unicorn servers.
How to mitigate the Unicorn problem
Make sure that you've set timeouts on all of your dependencies/initializers so you can reduce the amount of time it takes to start up a worker and ensure that nothing is holds up the booting/forking process.
Using Ruby? Try using Rack timeout.
Instead of having Unicorn timeout your requests, using Rack timeout will allow the errors to bubble up to the application logs instead of dying silently with a SIGKILL.
You can also checkout more timeout solutions for Ruby apps here.