Scaling Kamailio to 1,000 CPS and 60,000 concurrent calls — a journey through 13 bottlenecks
TL;DR — Kamailio itself is rarely the ceiling. We hit 1,000 CPS of stable signaling and 60,000 concurrent dialogs only after fixing 13 distinct bottlenecks across SHM, the TM layer, dispatcher probes, the worker pool, RTPEngine, the database hot path, and Redis. Almost none of those 13 were Kamailio-the-process slowing down — they were defaults that don't survive contact with real load.
This is the long-form writeup of a multi-week capacity project on a production VoIP platform. Names and infra details are kept generic; the techniques are not.
The honest goal
The headline target was 1,000 CPS sustained, with 60,000 concurrent calls in flight. Two numbers, one promise: a single SBC pair must be able to carry a peak hour without queuing, dropping, or deteriorating PDD.
Unless stated otherwise, the CPS numbers below refer to the active SBC node under test. Production headroom then depends on whether the SBC pair is operated active/standby or active/active.
Two things matter before we go further:
- CPS = new INVITEs accepted per second. It stresses the signalling path — TM transactions, dispatcher selection, route logic, htable lookups, optionally a few SQL round-trips per call.
- Concurrent = active dialogs at any moment. It stresses state: SHM, the dialog module, the dispatcher table, RTPEngine session slots, downstream UAS limits, and — most importantly — your session-timer hygiene.
A platform that does 1,000 CPS for ten minutes and a platform that sustains 1,000 CPS at 60k concurrent are very different machines. The second one is the one we wanted.
The number behind the number
The first lesson, learned painfully:
| Architecture | Sustained CPS | Limiting factor |
|---|---|---|
| Kamailio rejection-only (no backend, no media) | 1,000+ | None — Kamailio itself is fine |
| Full path with media-proxy + heavyweight UAS | 200–300 | Media-proxy ng-protocol single UDP socket |
| Direct media + heavyweight UAS | ~500 | UAS CPU per call |
| Direct media + lightweight UAS | 500–600 | Kernel UDP socket on Kamailio (no SO_REUSEPORT in 5.8) |
Read the table top to bottom: Kamailio's SIP processing is not the constraint. Every other layer is. The single biggest mistake teams make is treating "Kamailio CPS" as the platform's CPS — it isn't, and you'll spend weeks tuning the wrong process.
The path to 1,000 CPS full-path is therefore a stack-wide exercise: parallel media instances, lightweight UAS where it makes sense, multi-IP listeners, and aggressive caching of every per-call SQL query.
The 13 bottlenecks, condensed
These appeared in roughly this order under increasing CPS. Anyone scaling Kamailio will hit a strict subset of these — usually starting with #1 and #4.
1. SHM exhaustion from per-call htables
A per-call htable used for state (one entry per call ID, ~30 keys) accumulated entries because they were only freed at dialog:end. Failed calls — CANCELs, backend errors, dispatcher exhaustion — never reached dialog:end and leaked state. At 50 CPS we crashed in ~70 seconds.
Fix: explicit cleanup of every per-call key in both event_route[dialog:end] and event_route[dialog:failed]. Do not rely on dialog:failed as the only safety net: depending on where the call failed and whether dialog tracking had already been established, explicit cleanup may also be needed in failure_route, CANCEL handling, backend rejection paths, and dispatcher-exhausted branches. Dropped autoexpire from 3,600s to 1,800s as a belt-and-braces safety net. Bumped SHM from 64 MB to 1 GB — but the cleanup is what stopped the leak; the SHM bump just bought time during diagnosis.
Generalisation: any htable, dialog AVP, or
$xavpyou create per-call needs a cleanup path on every terminal route, including failure routes. SHM accounting is not garbage-collected.
2. Dispatcher probe sensitivity
Default ds_probing_threshold=1 means a single missed OPTIONS marks a backend dead. Under load the backend can miss one. We watched five healthy backends get marked Inactive in 60 seconds at 300 CPS, leaving zero dispatch targets.
Fix: ds_probing_threshold 1→5, ds_inactive_threshold 1→5, ping interval 60→30s. The shorter interval increased sampling frequency; the higher thresholds prevented a single missed OPTIONS from causing a state flip.
3. Worker starvation
64 UDP workers were enough for INVITE processing but not for INVITE + ACK + the retransmissions that result when ACKs are delayed. The retransmissions amplify the load: a stuck queue creates more queue.
Fix: workers 64→256 on the upgraded SBC hardware. Note the implication for dependencies — see #7.
4. TM transaction pile-up
Long INVITE transaction lifetimes are dangerous at high CPS. In our setup, incomplete INVITE transactions could remain resident long enough to accumulate rapidly under load. At 1,000 CPS, even a 120–180 second effective lifetime can mean well over 100,000 transactions competing for SHM. We observed roughly 45,000 stuck transactions before SHM pressure became visible.
Fix: reduce the effective INVITE timeout window to 30,000 ms where appropriate for our traffic profile. The TM layer cleaned up faster, and SHM stopped drifting upward. We also tightened non-INVITE timers, but the INVITE transaction lifetime was the dominant lever.
5. RTPEngine session leak
Media sessions were torn down on BYE but not on every failure terminus — CANCEL, backend rejection, dispatcher exhaustion, dialog timeout. Stale sessions accumulated at thousands per hour and load average climbed past 25 on 32-core media servers.
Fix: explicit rtpengine_delete() in failure_route (CANCEL + dispatcher exhausted), event_route[dialog:failed], and the branch BYE path. Same principle as #1 — every terminal route in routing.cfg needs the cleanup.
6. UAS-side session limits
Backend defaults are conservative. sessions-per-second=30 and max-sessions=1000 evaporate at 100 CPS. Bumping them is necessary but not sufficient — they must also match your CPS-per-source-IP rate-limit policy upstream, or one provider can starve another.
7. PostgreSQL connection exhaustion
256 Kamailio workers × N database connections per worker = enough to starve the database. We hit "too many clients already" on startup. Two fixes:
1. Increased max_connections on the database fleet.
2. Removed an unused database connection that was being opened per-worker for a feature that no longer existed. ~256 connections per SBC reclaimed — for free.
Generalisation: any DB-backed module or configured DB handle that initializes per worker can multiply connection usage when worker count increases. Audit database handles before scaling workers.
8. Unused module overhead
Eight modules loaded, never used, all consuming SHM + PKG per worker. Fourteen MOS AVP computations per call — never read. Thirty-nine benchmark timer calls per INVITE — never aggregated. Dead config grows over the years; scaling is a great forcing function to clean it up.
9. Redis in the hot path
A daily-counter rate limiter took 8–16 Redis round-trips per call. At a few hundred CPS that's a millisecond-class tax that compounds in the worker pool.
Fix: moved local-counter rate limiting into an htable ($shtinc-driven CPS, dialog profiles for concurrent). Redis stayed for cluster-shared state and the live-call push channel — its hot-path role disappeared. Per-SBC enforcement only; cluster-wide is a Phase 2 concern.
10. Dialog module synchronous writes
dialog db_mode=1 can perform blocking database updates on dialog state transitions. At 1,000 CPS, even a handful of dialog writes per call can become thousands of blocking writes per second. The blocking is what kills you, not only the raw write rate.
Fix: db_mode 1→2 (delayed periodic flush). State is still persisted, just not synchronously on the hot path. We also enabled auto_inv_100, disabled cdr_on_failed, and roughly doubled htable sizes across the board.
11. Per-call SQL on the routing path
Some routing decisions did SELECT … WHERE prefix LIKE … || '%' against a six-figure-row table per call. Even with the right index this is a millisecond-scale lookup running per INVITE.
Fix: migrate the prefix-match path to mtree — Kamailio's longest-prefix-match trie loaded from a database view at startup and reloadable via JSON-RPC. After migration, sub-millisecond lookups in-process; SQL kept only as a fallback (or removed entirely on the paths where the trie is authoritative).
Generalisation: if you
LIKE 'prefix%'against a hot table per call, you almost certainly wantmtreeinstead. The 5-line modparam cost is worth weeks of re-tuning.
12. Cold-start cliff on warmup caches
At restart, the warmup htables (provider IP map, prefix → carrier cache) are empty for ~10 minutes while the rtimer populates them. Every call during that window falls back to per-call SQL — which collapses under load.
Fix: synchronous warmup of the smallest, most critical caches before the listener accepts traffic. Larger caches stay on the rtimer, but the SBC no longer accepts traffic in a state where every call hits the database.
13. Linux kernel UDP socket
On our Kamailio 5.8 deployment, UDP receive parallelism was effectively bounded per listener socket. Past ~600 CPS on a single SBC IP, packet delivery to that listener became the bottleneck.
Fix: multi-IP listener topology. Each major provider/source-IP pair gets dedicated listener IPs, distributing UDP receive work across multiple sockets. This was cheaper and safer than re-architecting the worker model during the capacity project.
The 60,000 concurrent number
CPS is one axis; concurrent dialogs is the other. The concurrent figure is determined by how long calls stay alive more than by how fast you accept them. Three pieces had to be right:
- RFC 4028 session timers with
min_se=90and a saneSession-Expires(we settled on 1,800 s). Without session timers, network blips orphan dialogs that linger for hours. - Dialog
default_timeout=1,800. Bounded stuck-call lifetime to 30 minutes regardless of what the endpoints do. - Media-proxy
max-sessionsscaled to fleet capacity. We landed at 20,000 per instance × 4 instances = 80,000 — comfortably above 60k peak with headroom for re-INVITEs and short-lived overlap.
The SBCs themselves carry concurrent state cheaply once SHM is sized correctly. Concurrent capacity is mostly an sst + dialog + media-fleet conversation, not a Kamailio-process conversation.
What we deliberately rolled back
A late-stage proposal: a per-provider boolean to bypass the media proxy for direct-peering providers, breaking the ng-protocol single-socket ceiling for that subset of traffic. We built it, tested it, and rolled it back end-to-end.
Why: - It introduced a runtime decision in routing that affected the call's media path — a class of decision that's expensive to debug post-incident. - The same throughput could be reached by adding media-proxy instances horizontally — a known-quantity scaling pattern. - Operationally, "every call goes through the media proxy" is a much simpler invariant to defend than "every call goes through the media proxy unless this provider has the bypass flag set". Invariants you can state in one sentence are the invariants you can monitor.
The rollback was clean: schema column dropped, routing changes never reached production, spec text reverted. Ten weeks later we don't regret it. The lesson here isn't that direct media is bad — it's that adding configurable behaviour to the call path is a load-bearing decision and deserves its own spec, not an appendix to a capacity feature.
The pattern, in one paragraph
If you want a single takeaway: scaling Kamailio is a discipline of finding the per-call cost that grows linearly with CPS and removing it. SHM that grows because of a missing event_route[dialog:failed] cleanup. A Redis call on the hot path. A LIKE against a hot table. Worker starvation that creates retransmits that create more worker starvation. The SBC becomes fast not because Kamailio gets faster, but because the loop around each INVITE gets emptier.
Build the empty loop first. Then add what you must. Validate that what you must add is genuinely O(1) per call.
What's next
In follow-up posts:
- The mtree migration in detail — schema view, module setup, the JSON-RPC reload pattern, and the bugs that stopped us from cargo-culting it onto a second hot path.
- A long view of dialog db_mode and what state survives a SBC restart — and what doesn't.
- The session-timer story: why min_se=90 and not the textbook 1800, and what that does to mid-dialog re-INVITEs from misbehaving endpoints.
If you want a topic next, reach out.