This happens when NewServerTransport() returns nil, nil. The rawConn is
closed when the transport is closed, which will never happen in this
case (since the returned transport is nil).
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.
Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:
1. The stack is already grown to 8 KiB after the first RPC, so
subsequent RPCs do not call runtime.morestack.
2. We eliminate the need to spawn a new goroutine for each request
(even though they're relatively inexpensive).
Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
Title Before After Percentage
TotalOps 2613512 2701705 3.37%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 8657.00 8654.17 -0.03%
Allocs/op 173.37 173.28 0.00%
ReqT/op 348468.27 360227.33 3.37%
RespT/op 348468.27 360227.33 3.37%
50th-Lat 174.601µs 167.378µs -4.14%
90th-Lat 233.132µs 229.087µs -1.74%
99th-Lat 438.98µs 441.857µs 0.66%
Avg-Lat 183.263µs 177.26µs -3.28%
Performance benchmarks can be found below. Obviously, a 8 KiB
request/response is tailored to showcase this improvement as this is
where codec buffer reuse shines, but I've run other benchmarks too (like
1-byte requests and responses) and there's no discernable impact on
performance.
We do not allow reuse of buffers when stat handlers or binlogs are
turned on. This is because those two may need access to the data and
payload even after the data has been written to the wire. In such cases,
we never return the data back to the pool.
A buffer reuse threshold of 1 KiB was determined after several
experiments. There's diminished returns when buffer reuse is enabled for
smaller messages (actually, a negative impact).
unary-networkMode_none-bufConn_false-keepalive_false-benchTime_40s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_6-reqSize_8192B-respSize_8192B-compressor_off-channelz_false-preloader_false
Title Before After Percentage
TotalOps 839638 906223 7.93%
SendOps 0 0 NaN%
RecvOps 0 0 NaN%
Bytes/op 103788.29 80592.47 -22.35%
Allocs/op 183.33 189.30 3.27%
ReqT/op 1375662899.20 1484755763.20 7.93%
RespT/op 1375662899.20 1484755763.20 7.93%
50th-Lat 238.746µs 225.019µs -5.75%
90th-Lat 514.253µs 456.439µs -11.24%
99th-Lat 711.083µs 702.466µs -1.21%
Avg-Lat 285.45µs 264.456µs -7.35%
Continuing the war on stacks, we can reduce the amount of stack required
per-RPC by combining defers from different components into one.
Each defer statement in process{Unary,Streaming}RPC goes on the stack
and occupies about 56-64 bytes the entire lifetime of an RPC, which
could be very long. More importantly, a call to runtime.morestack is
often required to allocate a new, larger stack when the handler
goroutine runs out of stack memory (Go's default stack size is 2 KiB).
Before:
$ go tool objdump <binary> | grep "TEXT.*processUnaryRPC(SB)" -A 10 | grep "SUBQ.*SP"
server.go:867 0x9132fb 4881ec80030000 SUBQ $0x380, SP
$ go tool objdump <binary> | grep "TEXT.*processStreamingRPC(SB)" -A 10 | grep "SUBQ.*SP"
server.go:1099 0x9151bb 4881ec68020000 SUBQ $0x268, SP
After:
$ go tool objdump <binary> | grep "TEXT.*processUnaryRPC(SB)" -A 10 | grep "SUBQ.*SP"
server.go:867 0x9132fb 4881ecd0020000 SUBQ $0x2d0, SP
$ go tool objdump <binary> | grep "TEXT.*processStreamingRPC(SB)" -A 10 | grep "SUBQ.*SP"
server.go:1116 0x9150fb 4881ecf8010000 SUBQ $0x1f8, SP
As one can observe, the processUnaryRPC's stack goes down from 0x380
bytes to 0x2d0 bytes (896 - 720 = 176 bytes) while processStreamingRPC's
stack goes down from 0x2d8 bytes to 0x1f8 bytes (616 - 504 = 112 bytes).
There are probably other things we can do here, but these are some low
hanging fruits to pick off.
The client-side traces were otherwise only showing `RPC: to <nil>`,
which is not helpful.
Also clean up construction of traceInfo and firstLine in a few places.
Also includes:
- Export `NewLoggerFromConfigString` so it can be also used when config string is specified in another way (e.g. command line flag)
- Export `Logger` so user can install custom sink
- Add temp file sink implementation