You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ReactorNettyClient advertises SETTINGS_MAX_FRAME_SIZE = 65536 (64 KB) when negotiating HTTP/2 with the Cosmos Gateway, but the Gateway sends DATA frames larger than this — observed up to 3,683,373 bytes (~3.7 MB) against *-eastus2.documents.azure.com (multi-write, eventual). When this happens, Netty rejects the frame and the entire HTTP/2 parent TCP connection becomes unusable.
Workload mix that includes large reads / queries (point reads, ReadFeed of a logical partition, parallel queries)
Within ~30 s of warmup, the writes JVM logs 100+ events like:
io.netty.handler.codec.http2.Http2Exception: Frame length: 3683373 exceeds maximum: 65536
at io.netty.handler.codec.http2.DefaultHttp2FrameReader.preProcessFrame(DefaultHttp2FrameReader.java:195)
...
each immediately followed by a cascade:
WARN com.azure.cosmos.implementation.http.Http2ParentChannelExceptionHandler -
Exception on HTTP/2 parent connection
[channel=[id: ..., L:/10.0.0.4:33134 - R:thin-client-mwr-eventual-ci-eastus2.documents.azure.com/40.84.77.110:443],
activeStreams=2, channelActive=true, ...]
io.netty.util.IllegalReferenceCountException: Http2FrameCodec#decode() might have released its input buffer...
Caused by: io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
The IllegalReferenceCountException is the secondary symptom — the failed frame leaves the inbound ByteBuf in a bad refCount state, and the next handler's release() trips over refCnt == 0.
Observed counts in a single 3-minute window of one writes JVM at concurrency 25: ~168 events. Each event takes down the parent TCP connection along with 2–3 in-flight streams, which then retry on a fresh connection — successful end-to-end but with a tail-latency hit and very noisy logs.
The 64 KB cap is far below what the Cosmos Gateway sends back for normal reads/queries. RFC 7540 allows SETTINGS_MAX_FRAME_SIZE up to 2^24 - 1 (16,777,215).
Proposed fix
Raise the default maxFrameSize to 1 MB (16× current). 1 MB comfortably covers the observed 3.7 MB outliers — even if the server still sends oversized frames occasionally, raising the floor will dramatically cut event frequency. (If we want to fully cover the observed payload, 4 MB is safer.)
Bump initialWindowSize to match (≥ maxFrameSize) so flow control does not become the new bottleneck — e.g. initialWindowSize = max(1 MB, 2 × maxFrameSize).
Expose both as knobs on Http2ConnectionConfig (setMaxFrameSize, setInitialWindowSize) so callers with unusual workloads (very large docs, big paginated queries) can tune them.
Side effects to consider
Side effect
Severity
Mitigation
Per-frame memory — Netty must hold a contiguous ByteBuf of up to maxFrameSize while decoding. 1 MB × 30 concurrent streams = ~30 MB worst case per HTTP/2 connection. Pooled DirectByteBuf from PooledByteBufAllocator handles this efficiently.
Low
Default JVM MaxDirectMemorySize is usually sufficient. Document the new ceiling.
Head-of-line blocking on parent TCP — A frame is transmitted atomically. A 1 MB frame on stream A delays interleaving for streams B, C, D on the same connection until it finishes. Larger frame → more HOL impact.
Medium
1 MB is the typical industry sweet spot (gRPC default is 4 MB, Envoy default is also conservative). 16 MB would noticeably hurt tail latency of small concurrent requests, so don't max it out by default.
Pooled allocator fragmentation — Default arena chunk is 16 MB / 8 KB pages. 1 MB allocations compose cleanly; 4 MB still pools well; 16 MB may force unpooled allocation in some configs.
Low at 1 MB
Stay ≤ 4 MB by default.
Flow-control mismatch — If maxFrameSize > initialWindowSize, the server stalls between frames waiting for WINDOW_UPDATE.
Medium
Always raise initialWindowSize together.
TLS record overhead — TLS records are ≤ 16 KB, so SslHandler accumulates more partial decrypts before passing them up for larger frames. CPU impact is negligible.
Negligible
None needed.
DoS surface from a misbehaving peer — maxConcurrentStreams × maxFrameSize is the worst-case memory pin per connection. With current maxConcurrentStreams and 1 MB, this is bounded and small.
Low
Cosmos Gateway is trusted; mTLS + TLS validates origin.
Summary
ReactorNettyClientadvertisesSETTINGS_MAX_FRAME_SIZE = 65536(64 KB) when negotiating HTTP/2 with the Cosmos Gateway, but the Gateway sends DATA frames larger than this — observed up to 3,683,373 bytes (~3.7 MB) against*-eastus2.documents.azure.com(multi-write, eventual). When this happens, Netty rejects the frame and the entire HTTP/2 parent TCP connection becomes unusable.Repro
main+ PR Add HTTP/2 PING for broken connection detection. #49095 (HEAD3d10f06fe4aat the time)azure-cosmos-benchmarkAsyncWriteBenchmark/AsyncQueryBenchmarkHttp2ConnectionConfig.setEnabled(true)), thin-client disabledWithin ~30 s of warmup, the writes JVM logs 100+ events like:
each immediately followed by a cascade:
The
IllegalReferenceCountExceptionis the secondary symptom — the failed frame leaves the inboundByteBufin a bad refCount state, and the next handler'srelease()trips overrefCnt == 0.Observed counts in a single 3-minute window of one writes JVM at concurrency 25: ~168 events. Each event takes down the parent TCP connection along with 2–3 in-flight streams, which then retry on a fresh connection — successful end-to-end but with a tail-latency hit and very noisy logs.
Root cause
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java#L153-L157The 64 KB cap is far below what the Cosmos Gateway sends back for normal reads/queries. RFC 7540 allows
SETTINGS_MAX_FRAME_SIZEup to2^24 - 1(16,777,215).Proposed fix
maxFrameSizeto 1 MB (16× current). 1 MB comfortably covers the observed 3.7 MB outliers — even if the server still sends oversized frames occasionally, raising the floor will dramatically cut event frequency. (If we want to fully cover the observed payload, 4 MB is safer.)initialWindowSizeto match (≥maxFrameSize) so flow control does not become the new bottleneck — e.g.initialWindowSize = max(1 MB, 2 × maxFrameSize).Http2ConnectionConfig(setMaxFrameSize,setInitialWindowSize) so callers with unusual workloads (very large docs, big paginated queries) can tune them.Side effects to consider
ByteBufof up tomaxFrameSizewhile decoding. 1 MB × 30 concurrent streams = ~30 MB worst case per HTTP/2 connection. PooledDirectByteBuffromPooledByteBufAllocatorhandles this efficiently.MaxDirectMemorySizeis usually sufficient. Document the new ceiling.maxFrameSize>initialWindowSize, the server stalls between frames waiting forWINDOW_UPDATE.initialWindowSizetogether.SslHandleraccumulates more partial decrypts before passing them up for larger frames. CPU impact is negligible.maxConcurrentStreams × maxFrameSizeis the worst-case memory pin per connection. With currentmaxConcurrentStreamsand 1 MB, this is bounded and small.Environment
-Xmx6g -Xms6g -XX:MaxDirectMemorySize=4g -XX:+UseG1GCazure-cosmos4.0.1-beta.1build frommain+ PR Add HTTP/2 PING for broken connection detection. #49095Happy to grab full thread dumps, GC logs, or netty wiretap captures if useful.