Supabase Connection Limit: Cloud Run Outage Fix
Introduction: When Your Database Says "No More!"
Ever had that sinking feeling when your production application suddenly grinds to a halt? It's a developer's worst nightmare, and for one team, that nightmare involved Supabase's free plan connection limit being hit, causing a critical outage in their Cloud Run production environment. This isn't just a minor glitch; it's a full-blown production incident that requires immediate attention and a deep dive into the root cause. Our journey today is to unravel this mystery, understand why this happened, and chart a course to prevent it from ever happening again. We'll be exploring the intricate dance between Cloud Run's scaling capabilities and Supabase's resource constraints, particularly focusing on how a seemingly innocent task like resizing images can inadvertently bring down your entire system. Get ready to dive deep into database connections, Cloud Run configurations, and the critical importance of understanding your service's limits.
The Scenario: Image Resizing Triggered a Supabase Meltdown
Let's set the stage for this production drama. The core issue arose during a high-throughput image resizing job. This particular process, churning out approximately 500 requests per minute (or about 8.3 requests per second), needed to write its results to Supabase, which uses PostgreSQL under the hood. Normally, this wouldn't be a problem. However, during peak load, the system hit a hard wall: Supabase's maximum connection limit of 16. This meant new connection attempts were being rejected, leading to a cascade of errors and a complete failure of the resizing job's critical write operations. What's particularly puzzling is that the application, a NestJS app running on Cloud Run, was designed with database connection management in mind. The database client was initialized only once globally within the application's lifecycle, and crucially, the client pool size (max) was intentionally set to a lean 1. Furthermore, connections to Supabase were routed through PgBouncer, a connection pooler. This setup, in theory, should have prevented such a bottleneck. The fact that the development environment running the same process didn't exhibit this behavior strongly suggested that the difference lay in the Cloud Run configurations between production and development, specifically concerning settings like maximum instances and concurrency.
The Nitty-Gritty: What Exactly Went Wrong?
When the image resizing job ramped up to approximately 500 requests per minute, it put a significant strain on the database. The real culprit, however, wasn't just the load itself, but how Cloud Run responded to it. It's believed that during these spikes, the production Cloud Run instance scaled up quite aggressively, potentially reaching around 16 containers (instances). Now, remember that each of these containers was configured with a database client pool size of max = 1. When these instances connected to Supabase through PgBouncer, the total number of active connections effectively became number of instances × pool size per instance. In this scenario, with roughly 16 instances and a pool size of 1, the total connection count hit the Supabase free plan's limit of 16. Once this ceiling was reached, any new attempts to write data to the database failed, resulting in connection refusal or timeout errors. This meant the image resizing job, a crucial part of the workflow, was unable to complete its final write step, leading to the production incident.
Unraveling the Mystery: Potential Causes
Several factors could have contributed to this Supabase connection limit crisis. Let's break down the most likely culprits:
1. Unchecked Cloud Run Production Instance Scaling
This is perhaps the most probable cause. If the maximum number of instances allowed for the production Cloud Run service was not set or was set too high, Cloud Run could have spun up a large number of containers in response to traffic spikes. Each of these containers, even with a minimal pool size, adds to the total connection count. The development environment, likely with stricter instance limits, wouldn't have hit this ceiling, explaining the difference in behavior.
2. Discrepancies in Cloud Run Concurrency Settings
Concurrency dictates how many requests a single Cloud Run instance can handle simultaneously. If the production environment had a lower concurrency setting than development, it would mean that each instance could handle fewer requests before needing to scale out. A lower concurrency could force Cloud Run to spin up more instances even for moderate traffic increases, directly contributing to the connection limit issue. Conversely, a higher concurrency in development allows a single instance to manage more load, requiring fewer instances overall.
3. Accidental Multiple Database Client Initializations
While the intention was to initialize the database client only once globally, there's a small but non-zero chance that it was effectively initialized multiple times in production. This could happen due to complex module import cycles or misconfigurations in dependency injection scopes (like using REQUEST or TRANSIENT scopes incorrectly). If each of these