Fixing AWS Lambda VPC cold start timeouts under load
Lambda with VPC config hits 15+ sec cold starts when scaling. Real fix: use hyperplane ENI attachments and switch to SDK v3 with keep-alive.
Quick answer (for the impatient)
Switch your Lambda runtime to AWS SDK v3 and enable tcpKeepAlive on the Node.js HttpClient. That alone drops cold start time from 15s to under 2s in most VPC setups. If you're still stuck, move to SDK v3.
What's actually happening here
You wrote a Lambda that talks to a database inside a VPC — maybe RDS, maybe ElastiCache. Works fine at low traffic. Then you get a traffic spike, new concurrent invocations fire up, and suddenly half of them time out after 15, 20, even 30 seconds. The errors look like a generic timeout, but your function code isn't slow. The problem isn't your logic. It's the cold start itself being murdered by the VPC networking stack.
Here's the root cause. Lambda needs an Elastic Network Interface (ENI) attached to your VPC to get an IP and route traffic. When a new execution environment spins up (that's a cold start), Lambda's orchestrator asks the VPC to create an ENI for it. This is not fast — it's several round trips through the AWS control plane, and under concurrency pressure, those requests queue up. Lambda waits on the ENI attachment before it even runs handler(). So your function's billed duration starts only after the ENI is ready, but the client calling Lambda sees a timeout because the total wall-clock time exceeds your configured timeout.
The real kicker: Lambda's default HTTP client in SDK v2 opens a new TCP connection for every request to AWS services (like Secrets Manager or DynamoDB). That adds another 100–300ms per call on top of the ENI delay. With multiple SDK calls in cold start, you're easily past 10 seconds.
The fix: step by step
- Upgrade to AWS SDK v3 for JavaScript. If you're on Node.js 18+, you already have it available. The v2 SDK is end-of-life and its default HTTP agent doesn't support keep-alive properly. Run
npm install @aws-sdk/client-dynamodb @aws-sdk/lib-dynamodb(or whatever services you use). - Explicitly enable TCP keep-alive. In SDK v3, you configure this via the
requestHandleroption. Example for DynamoDB:
This reuses TCP connections across invocations and across cold starts where the execution context is reused. Theimport { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { NodeHttpHandler } from '@smithy/node-http-handler';
const client = new DynamoDBClient({
requestHandler: new NodeHttpHandler({
connectionTimeout: 5000,
socketTimeout: 5000,
keepAlive: true,
maxSockets: 50
})
});maxSocketsof 50 lets concurrent SDK calls share the same connection pool. - Reduce ENI creation delay with Hyperplane ENI (Lambda hyperplane). AWS's new hyperplane ENI attachment mode is faster because it pre-allocates IP addresses from your subnet. You don't manually configure this — it's enabled automatically for newer Lambda functions in VPC. But you must ensure your subnets have enough free IPs for your max concurrency. Each concurrent execution needs one IP. If your subnet has 256 IPs but you're scaling to 300 concurrent, Lambda falls back to the slow path. Check your subnet CIDR and size it for peak concurrency + 20% buffer.
- Set a realistic
reservedConcurrencyon the function that matches your RDS connection pool size. If your database only handles 50 connections, cap Lambda to 50 concurrent. Otherwise you'll have timeouts from the DB side, which look identical to cold start timeouts. This isn't a Lambda fix per se, but it's the most common misdiagnosis I see. - Use Provisioned Concurrency for the lowest-latency use cases. Yes, you pay for idle. But if you need sub-second response under load, provision 10–20% of your expected peak. Lambda keeps those environments warm, ENI already attached. Downside: you must manually adjust the provisioned count as traffic patterns shift. I use this only for critical APIs with known traffic floors.
Alternative fixes if the above doesn't work
Sometimes the SDK upgrade alone doesn't cut it because your cold start code hits something else slow. Try these:
- Switch runtime to Python 3.12+ or Go. Python's
boto3has had keep-alive for years, but you must setconfig = Config(tcp_keepalive=True). Go's SDK v2 uses HTTP/2 by default, which multiplexes connections and avoids the entire TCP handshake problem. I've seen Go Lambdas cold start in 800ms even in a VPC. - Pre-warm via a CloudWatch Events rule every 5 minutes. Hacky but works. You invoke the function with a dummy event that does nothing, keeping the environment warm. Problem: you pay for those invocations and the ENI stays attached. Also, AWS can still recycle your environment at any time, so it's not a guarantee — it just reduces probability.
- Move stateful data out of the cold start path. If you're loading a 50MB model or connecting to a slow service during init, push that to an
@lambda/initlayer or a sidecar extension. The init phase runs before the handler and doesn't count toward the timeout, but it does count toward the total wall-clock time the caller experiences. So cache credentials, pre-warm connections withafterConnectpatterns.
Prevention: design for cold starts from day one
The most opinionated piece of advice I can give: test your Lambda cold start under load before writing a single line of business logic. Create a simple function that connects to your VPC resource and runs. Invoke it with 10 concurrent requests using aws lambda invoke --cli-binary-format raw-in-base64-out --function-name yourFunction --payload '{}' & in a shell loop. Measure the response time of the slowest call. That number is your baseline.
If it's over 3 seconds, you've got a networking or SDK issue. Don't start coding features until that number is under 1.5 seconds. Bad baseline never gets better with more code.
Also, never put Lambda in a VPC unless it must access private resources. If you're just calling public APIs, leave it out of the VPC. The cold start goes from 10+ seconds to under 500ms. That's not an exaggeration — I measured it on a production workload in us-east-1 in late 2024. The difference is entirely the ENI attachment delay.
Final thought: the biggest sin I see is people throwing more memory at the function hoping it speeds up cold starts. Memory affects CPU, not network provisioning. It does not help here. Save your money.
Was this solution helpful?