Understanding Heroku Postgres Out of Memory Errors

Issue

As a user of Heroku Postgres, I am getting out of memory errors, but metrics are still showing memory in cached and free. I'd like to understand why that is.

Resolution

Postgres out-of-memory (OOM) errors can be surprising, especially when there is still memory available in the memory-cached. This happens is due to a combination of VM settings in the server hosting your Heroku Postgres database, and Postgres' preference for contiguous memory allocation. Smaller Postgres plans may run out of memory with half of their memory still in memory-cached. This does not mean that Postgres is not using this memory, as it leverages the OS cache (a majority of memory-cached) to improve query performance. However, Postgres is prevented from using this cached memory directly for things like creating new connections or complex querying that use work_mem.

Understanding Postgres Memory Pressure and Out of Memory Errors

In order to monitor memory pressure on your server, track the amount of memory available to Postgres and other server processes. The two most important metrics for doing so, memory-free and memory-cached, are available through the Heroku Postgres Metrics Logs.

Server memory is a valuable resource, so the operating system uses all available memory rather than letting it sit idle. It grants memory to processes that request it, or allocates it to memory-cached to improve overall performance. Because of this, it is not uncommon to see low numbers for the memory-free metric. These low numbers don't necessarily indicate that your server is low on memory, so in order to understand if there is memory contention we must look at another metric, memory-cached.

memory-cached represents the amount of memory held in the operating system cache. It is common to see large portions of your total-memory being held by memory-cached. This is a feature, not a bug. It means that Postgres processes don't currently need additional memory and the operating system is using it to optimize read and write performance.

In the event that Postgres needs additional memory, the operating system will begin to free cached memory. When this happens you should see an upward spike in the memory-free metric as well as a similar downward spike in utilization for the memory-cached metric.

The operating system is not able to free up memory indefinitely though. After certain limits are reached the OS will no longer be able to move memory from the memory-cached bucket into the memory-free bucket. This is why you may see smaller Postgres instances run out of memory with close to half of their memory in the memory-cached bucket.

When this happens your Postgres instance will begin reporting out of memory (OOM) errors. These errors, as you would expect, indicate that the Postgres processes are no longer able to obtain the memory from memory-free that they are requesting.

If this happens frequently, consider upgrading your Heroku Postgres plan to get more memory.

What is the best way to monitor memory pressure on an instance?

The best way to monitor memory pressure on a Postgres instance is to monitor the memory-free and memory-cached metrics. As discussed above, when a Postgres process requests more memory, the operating system will release cached memory to free memory where it can then be consumed by the process requesting it. When the operating system can no longer release cached memory and there is very little memory available in memory-free, Postgres will throw "out of memory" errors. This can begin to happen when memory-cached + memory-free drops below ~50% of the databases memory-total. On smaller plans this may be skewed closer to 60% and on larger plans closer to 40%.

Why is my Postgres server out of memory with memory still available in memory-free and memory-cached?

Due to configuration (overcommit_ratio) on the server meant to protect Postgres from being killed unexpectedly, not all of the memory in the memory-cached is available to be used by Postgres processes directly.

The reported cached memory still benefits Postgres processes due to way that Postgres relies on the operating system's page cache when its own shared buffers cache is full.

The Role of Committed Memory

When Postgres processes report out of memory errors even when a large amount of memory is available to the server (memory-cached + memory_free), the underlying cause is generally committed memory. Committed memory is memory that the kernel has promised to processes, even if they haven't used it yet (and therefore isn't visible through the normal usage metrics).

Heroku Postgres servers on classic plans generally set the overcommit ratio to 80% to prevent the server from running out of memory and impacting Postgres availability. When the server's committed memory reaches 80% of physical RAM, the kernel refuses new allocations even if memory-cached + memory-free show available memory.

This is why OOM errors can happen when a large portion of memory appears available in cache: the database server hits the committed memory limit, but not the physical memory limit. The overcommit setting protects Postgres from being killed by the OOM killer, which would cause downtime.

To reduce committed memory pressure: use connection pooling to minimize active Postgres backends (each connection commits memory even when idle), lower the work_mem setting to reduce per-query memory commitments for sorts and hash operations, and consider upgrading to a larger plan if optimization alone doesn't provide enough headroom.

Ask on Stack Overflow

Engage with a community of passionate experts to get the answers you need

Ask on Stack Overflow

Heroku Support

Create a support ticket and our support experts will get back to you

Contact Heroku Support