zstdMaxBufferedEncoders value is too low which caused frequent zstd encoder object creation #2965

HenryCaiHaiying · 2024-08-19T06:13:46Z

Description

zstdMaxBufferedEncoders value is too low which caused many temporary zstd encoders objects being created, each encoder object is quite big (200KB) and this adds to woes of frequent GCs

Versions

| Sarama | Kafka | Go |
|1.4.0 | 3.8.0 | 1.21.5 |
| | | |

Configuration

Logs

logs: CLICK ME

Additional Context

This is related to #2964, after I did more cpu and memory profilings, I realized the frequent GC problem is caused by excessive creations of zstd encoder objects.

This is the snippet of top memory allocations of the program. In the below pprof listing, the top 2 allocation sites are from creating of zstd encoder objects (300KB each). The next 3 allocation sites are from encoding messages (2-4KB size for each message), the top 2 allocation dominates the memory usage. With this high number of memory allocations, the goroutine were frequently put into "GC assist wait" mode later when its memory allocation credits are in negative.

The root cause of frequent zstd encoder object creation is because zstdMaxBufferedEncoders is hard coded to be 1 in https://github.com/IBM/sarama/blob/main/zstd.go#L11. In a real production system where you have hundreds of brokers and producer objects, you will have hundreds of goroutines going through compression path, and this would cause frequent encoder object creation and destruction.

(pprof) top 30
Showing nodes accounting for 4701.54GB, 96.18% of 4888.34GB total
Dropped 712 nodes (cum <= 24.44GB)
Showing top 30 nodes out of 49
      flat  flat%   sum%        cum   cum%
 3606.36GB 73.77% 73.77%  3606.36GB 73.77%  github.com/klauspost/compress/zstd.encoderOptions.encoder
  840.66GB 17.20% 90.97%   840.66GB 17.20%  github.com/klauspost/compress/zstd.(*blockEnc).init
   67.22GB  1.38% 92.35%    67.22GB  1.38%  github.com/klauspost/compress/zstd.(*fastBase).ensureHist (inline)
   45.37GB  0.93% 93.28%  4656.45GB 95.26%  github.com/IBM/sarama.encode
   33.41GB  0.68% 93.96%  4605.47GB 94.21%  github.com/klauspost/compress/zstd.(*Encoder).EncodeAll

The text was updated successfully, but these errors were encountered:

HenryCaiHaiying · 2024-08-19T06:16:17Z

A sample stack trace of a goroutine being put into GC Assist Wait when its memory allocation credit is negative:

goroutine 10667 [GC assist wait]:
runtime.gopark(0xc045f85100?, 0x12daae8?, 0x0?, 0x0?, 0x74b7fd7f60a8?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc045f85048 sp=0xc045f85028 pc=0x13028ae
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.gcParkAssist()
        /usr/local/go/src/runtime/mgcmark.go:652 +0xe7 fp=0xc045f85080 sp=0xc045f85048 pc=0x12e47e7
runtime.gcAssistAlloc(0xc008282d00)
        /usr/local/go/src/runtime/mgcmark.go:509 +0x22a fp=0xc045f850e0 sp=0xc045f85080 pc=0x12e424a
runtime.deductAssistCredit(0x12d2593?)
        /usr/local/go/src/runtime/malloc.go:1291 +0x54 fp=0xc045f85108 sp=0xc045f850e0 pc=0x12d27f4
runtime.mallocgc(0x140048, 0x37cb1c0, 0x1)
        /usr/local/go/src/runtime/malloc.go:1006 +0xc9 fp=0xc045f85170 sp=0xc045f85108 pc=0x12d2029
runtime.newobject(0x4000000?)
        /usr/local/go/src/runtime/malloc.go:1328 +0x25 fp=0xc045f85198 sp=0xc045f85170 pc=0x12d28e5
github.com/klauspost/compress/zstd.encoderOptions.encoder({0x1, 0x2, 0x0, 0x0, 0x20000, 0x800000, 0x1, 0x1, 0x0, 0x0, ...})
        /go/pkg/mod/github.com/klauspost/[email protected]/zstd/encoder_options.go:60 +0x11d fp=0xc045f851b8 sp=0xc045f85198 pc=0x1fd467d
github.com/klauspost/compress/zstd.(*Encoder).initialize(0xc020c89540)
        /go/pkg/mod/github.com/klauspost/[email protected]/zstd/encoder.go:88 +0x197 fp=0xc045f852a0 sp=0xc045f851b8 pc=0x1fd1937
github.com/klauspost/compress/zstd.(*Encoder).initialize-fm()
        <autogenerated>:1 +0x25 fp=0xc045f852b8 sp=0xc045f852a0 pc=0x1fe3b45
sync.(*Once).doSlow(0x0?, 0xc000000000?)
        /usr/local/go/src/sync/once.go:74 +0xbf fp=0xc045f85318 sp=0xc045f852b8 pc=0x1344d7f
sync.(*Once).Do(...)
        /usr/local/go/src/sync/once.go:65
github.com/klauspost/compress/zstd.(*Encoder).EncodeAll(0xc020c89540, {0xc03606a380, 0x380, 0x380}, {0x0, 0x0, 0x0})
        /go/pkg/mod/github.com/klauspost/[email protected]/zstd/encoder.go:494 +0x19f fp=0xc045f85498 sp=0xc045f85318 pc=0x1fd3a1f
github.com/IBM/sarama.zstdCompress({0x0?}, {0x0, 0x0, 0x0}, {0xc03606a380, 0x380, 0x380})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/zstd.go:71 +0x86 fp=0xc045f85510 sp=0xc045f85498 pc=0x20495c6
github.com/IBM/sarama.compress(0x20?, 0xc00bf10210?, {0xc03606a380, 0x380, 0x380})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/compress.go:190 +0xdae fp=0xc045f85a38 sp=0xc045f85510 pc=0x200bb4e
github.com/IBM/sarama.(*RecordBatch).encodeRecords(0xc046511860, {0x3ea55e8?, 0xc0538882c0?})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/record_batch.go:206 +0x9a fp=0xc045f85a80 sp=0xc045f85a38 pc=0x203d7da
github.com/IBM/sarama.(*RecordBatch).encode(0xc046511860, {0x3ea55e8, 0xc0538882c0})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/record_batch.go:88 +0x231 fp=0xc045f85ae0 sp=0xc045f85a80 pc=0x203cd31
github.com/IBM/sarama.(*Records).encode(0xc0538882c0?, {0x3ea55e8?, 0xc0538882c0?})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/records.go:62 +0x98 fp=0xc045f85b28 sp=0xc045f85ae0 pc=0x203dab8
github.com/IBM/sarama.(*ProduceRequest).encode(0xc053888280, {0x3ea55e8, 0xc0538882c0})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/produce_request.go:108 +0x5e7 fp=0xc045f85cd8 sp=0xc045f85b28 pc=0x2035367
github.com/IBM/sarama.(*request).encode(0xc03bb15650, {0x3ea55e8, 0xc0538882c0})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/request.go:43 +0x13c fp=0xc045f85d00 sp=0xc045f85cd8 pc=0x203e4dc
github.com/IBM/sarama.encode({0x3e67160, 0xc03bb15650}, {0x3e91630?, 0xc00baeb440})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/encoder_decoder.go:29 +0x91 fp=0xc045f85d70 sp=0xc045f85d00 pc=0x201af31
github.com/IBM/sarama.(*Broker).sendInternal(0xc008281c00, {0x3e8dc30, 0xc053888280}, 0xc0091a2780)
        /go/pkg/mod/github.com/!i!b!m/[email protected]/broker.go:1001 +0x16b fp=0xc045f85e50 sp=0xc045f85d70 pc=0x1ffc94b
github.com/IBM/sarama.(*Broker).sendWithPromise(0xc008281c00, {0x3e8dc30, 0xc053888280}, 0x0?)
        /go/pkg/mod/github.com/!i!b!m/[email protected]/broker.go:991 +0x9e fp=0xc045f85e88 sp=0xc045f85e50 pc=0x1ffc75e
github.com/IBM/sarama.(*Broker).AsyncProduce(0xc008281c00, 0xc053888280, 0xc0538882a0?)
        /go/pkg/mod/github.com/!i!b!m/[email protected]/broker.go:470 +0x1cc fp=0xc045f85f10 sp=0xc045f85e88 pc=0x1ffa88c
/encoder.opt

HenryCaiHaiying · 2024-08-19T06:17:02Z

The fix is actually quite simple, allow zstdMaxBufferedEncoders to be specified through a Config param

dnwe · 2024-08-19T06:20:28Z

Yes so this was pinned to 1 as part of #2375 which aimed to control memory usage because the prior default behaviour had been to allocate 1 per goroutine. So the aim was to move to a shared pool, but only keep a single idle spare. It does sound like it would be useful to tradeoff static memory usage to avoid gc by keeping a larger pool ready of re-usable ones ready for use

HenryCaiHaiying · 2024-08-19T06:42:15Z

@dnwe I think we should allow the user to specify the pool size. By the way, I only measured about 300KB for each zstd encoder object (when using CompressionLevel: SpeedDefault) based on github.com/klauspost/compress v1.17.4 code.

…pressionEncoders a configurable parameter This PR is addressing issue: IBM#2965 For Zstd compression, currently the pool size of encoders (controlled by MaxBufferedEncoders param) is hard-coded to be 1, this caused excessive number of large encoder objects being created under high concurrency which caused GC performance issue. Introduce a new config parameter: MaxBufferedCompressionEncoders (default value 1) and modify the calling paths to pass this parameter through message/record_batch and down to compressor/encoder. Signed-off-by: Henry Cai <[email protected]>

rtreffer · 2024-08-25T10:46:58Z

I got nerd sniped by this a bit...

I have an experimental zstd encoder pool (read: don't use it without full understanding of the code).

An initial (probably flawed) pseudo-benchmark showed some 20% cpu improvement under high goroutine counts, even with encoder reuse (removing the limit on max buffered encoders means 100% reuse, to be verified). Most systems should see way less of an effect, but that's normal for benchmarks testing a subset, in addition to better memory efficiency. I want to turn this into a proper benchmark though.

I don't want to suggest this as an alternative right now (highly experimental and with quite some complexity), but I would really like to know the settings of the original application as there is a good chance that the absolute max encoder has a higher impact than the absolute minimum, especially with a high goroutines/GOMAXPROCS ratio, which seems to be symptomatic of the original issue.

There might be an easier way to just cap the maximum outstanding (handed out) encoders, I haven't investigated this yet though.

rtreffer · 2024-08-25T21:33:02Z

Ok, I managed to get a benchmark going to test different variants of pool configurations that is hopefully somewhat informative

GOMAXPROCS (1 & 4)
max idle encoders (1, 1000, no limit)
max number of outstanding (live) encoders (goroutines, GOMAXPROCS)

Each test launches 1000 goroutines competing for encoders. Each benchmark is executed with and without alloc measurement to avoid accounting issues. The first iteration is regarded as a "pool warmup" and always discarded.

Results:

$ go test -run=^$ -count 1 -timeout 15m -benchtime=10s -bench ^BenchmarkZstdCpuAndMemoryHighConcurrency github.com/IBM/sarama
goos: linux
goarch: amd64
pkg: github.com/IBM/sarama
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics     
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCEncoderLimit-16                                 1506000              7970 ns/op         256.96 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCEncoderLimitAlloc-16                            1500000              7966 ns/op         257.09 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdle1GOMAXPROCEncoderLimit-16                            1469000              8317 ns/op         246.25 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdle1GOMAXPROCEncoderLimitAlloc-16                       1492000              8087 ns/op         253.25 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdle1-16                                                 1495000              8035 ns/op         254.90 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdle1Alloc-16                                            1488000              7999 ns/op         256.04 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdleMaxGOMAXPROCEncoderLimit-16                          1510000              7955 ns/op         257.46 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdleMaxGOMAXPROCEncoderLimitAlloc-16                     1504000              8002 ns/op         255.93 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdleMax-16                                               1482000              8085 ns/op         253.30 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdleMaxAlloc-16                                          1503000              7960 ns/op         257.29 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4GOMAXPROCEncoderLimit-16                      4646000              2646 ns/op         774.06 MB/s             0 new/op                4.000 open_encs                 1.000 reuses/op                 0.9779 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4GOMAXPROCEncoderLimitAlloc-16                 4495000              2683 ns/op         763.38 MB/s             0 new/op                4.000 open_encs                 1.000 reuses/op                 0.9770 waits/op            2149 B/op          3 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4Idle1GOMAXPROCEncoderLimit-16                 3700000              3256 ns/op         629.03 MB/s             0.004118 new/op                 4.000 open_encs                 0.9962 reuses/op                0.9729 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4Idle1GOMAXPROCEncoderLimitAlloc-16            3723000              3256 ns/op         629.06 MB/s             0.004092 new/op                 4.000 open_encs                 0.9962 reuses/op                0.9721 waits/op            8891 B/op          3 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4Idle1-16                                       525000             27478 ns/op          74.53 MB/s             0.3383 new/op         122.0 open_encs           0.6636 reuses/op                0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4Idle1Alloc-16                                  481000             26266 ns/op          77.97 MB/s             0.3322 new/op         119.7 open_encs           0.6698 reuses/op                0 waits/op       548680 B/op         15 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4IdleMaxGOMAXPROCEncoderLimit-16               4329000              2704 ns/op         757.39 MB/s             0 new/op                4.000 open_encs                 1.000 reuses/op                 0.9786 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4IdleMaxGOMAXPROCEncoderLimitAlloc-16          4478000              2697 ns/op         759.50 MB/s             0 new/op                4.000 open_encs                 1.000 reuses/op                 0.9780 waits/op            2149 B/op          3 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4IdleMax-16                                    2541000              4454 ns/op         459.78 MB/s             0.0002601 new/op               12.31 open_encs          1.000 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4IdleMaxAlloc-16                               2723000              4796 ns/op         427.06 MB/s             0.0002629 new/op               11.83 open_encs          1.000 reuses/op                 0 waits/op         2651 B/op          3 allocs/op
PASS
ok      github.com/IBM/sarama   367.935s

The GOMAXPROCS=1 turn out to be the same for all variants. The GOMAXPROCS=4 cases turn out to be interesting.

Max Idle Encoders	Max Outstanding Encoders	Throughput	alloc bytes per op
-	GOMAXPROCS	774.06MB/s	2149
1000	GOMAXPROCS	757.39MB/s	2149
1	GOMAXPROCS	629.03MB/s	8891
1000	1000	459.78MB/s	2651
1	1000	74.53MB/s	548680

At high goroutine to GOMAXPROC counts it seems like limiting the in use encoders to GOMAXPROCS beats raising the idle encoders. On top of that it should conserve memory and avoid issues for high-compression use cases.

There is a high memory tax to keeping many idle or in use encoders. So I am a bit skeptical about just raising the idle encoders max as I think limiting the maximum number of encoders in use is highly beneficial, and potentially more beneficial for more usecases. The benchmark indicates that it can beat this synthetic case by another 68%. I would really like to see the goroutine to GOMAXPROC ratio and how this plays out. There is a chance that either of the 2 directions is better for the specific setup.

That said: the benchmark shows a potential for a 10x improvement for a rather common (anti)pattern: get a batch, split it into entries, launch a goroutine per entry. This is pretty huge. And even the the min encoders alone is a 5x improvement.

If I could only find a low complexity solution that offers these benefits.....

HenryCaiHaiying · 2024-08-26T02:02:04Z

@rtreffer Thanks for so quickly creating the encoder pool code and the benchmark test. I think the reason that maxIdleEncoders=4 's throughput is better than maxIdleEncoders=100 is because maxIdleEncoders=4 spent much less time on creating encoders objects (the max is capped at 4). Instead maxIdleEncoders=4 spend more time waiting on encoders becomes available from channel: https://github.com/rtreffer/sarama/blob/zstd-pool/zstd_pool.go#L113. In your benchmark test, the goroutine is only doing zstd compression on 2KB data which is very fast for it to turn around to return the encoder to the pool/channel. I don't remember all the details in sarama code, that goroutine might have other work to do (e.g. sending data through tcp connection to downstream). So depending on the time ratio of the goroutine spend (the time spend on its real computation work and the time on pool related activities), the pool reuse might not be that good.

Also this new pool code (especially the line on waiting for encoder from channel: https://github.com/rtreffer/sarama/blob/zstd-pool/zstd_pool.go#L113) changes the object creation dynamic from the old static MaxBufferedEncoders=1 code where there was no upper limit on how many encoders can be created. In the old code, I can see hundreds of those encoder objects being created all the time.

rtreffer · 2024-08-27T21:32:18Z

I think I finally found an easy way to express this all in code: rtreffer@23168b5

$ git diff --stat HEAD~1
 zstd.go | 53 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 13 deletions(-)

I can try to get this into a PR form soon, ideally with a benchmark. I wouldn't want to propose this without cleaning up the benchmark.

puellanivis · 2024-08-28T15:22:27Z

I’m not sure you have to worry about using an atomic.Int32 for zstdCheckedOutEncoders. Your code only ever manipulates or accesses it while holding the global lock. (Though, with the atomic, the whole releaseEncoder function doesn’t need to be grabbing the lock at all.)

HenryCaiHaiying · 2024-09-04T05:35:23Z

@rtreffer Are you still moving forward with your PR (rtreffer@23168b5) ? I left a comment in your PR.

rtreffer · 2024-09-04T05:37:16Z

I've been pretty busy over the last few days, I expect to have some time for this over the weekend

HenryCaiHaiying mentioned this issue Aug 21, 2024

feat: Make compression encoder pool size controlled by MaxBufferedCompressionEncoders a configurable parameter #2968

Open

rtreffer mentioned this issue Sep 7, 2024

Zstd encoder limit and raise encoder cache size #2979

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstdMaxBufferedEncoders value is too low which caused frequent zstd encoder object creation #2965

zstdMaxBufferedEncoders value is too low which caused frequent zstd encoder object creation #2965

HenryCaiHaiying commented Aug 19, 2024 •

edited

Loading

HenryCaiHaiying commented Aug 19, 2024 •

edited

Loading

HenryCaiHaiying commented Aug 19, 2024

dnwe commented Aug 19, 2024

HenryCaiHaiying commented Aug 19, 2024 •

edited

Loading

rtreffer commented Aug 25, 2024

rtreffer commented Aug 25, 2024

HenryCaiHaiying commented Aug 26, 2024

rtreffer commented Aug 27, 2024

puellanivis commented Aug 28, 2024

HenryCaiHaiying commented Sep 4, 2024

rtreffer commented Sep 4, 2024

zstdMaxBufferedEncoders value is too low which caused frequent zstd encoder object creation #2965

zstdMaxBufferedEncoders value is too low which caused frequent zstd encoder object creation #2965

Comments

HenryCaiHaiying commented Aug 19, 2024 • edited Loading

Description

Versions

Configuration

Logs

Additional Context

HenryCaiHaiying commented Aug 19, 2024 • edited Loading

HenryCaiHaiying commented Aug 19, 2024

dnwe commented Aug 19, 2024

HenryCaiHaiying commented Aug 19, 2024 • edited Loading

rtreffer commented Aug 25, 2024

rtreffer commented Aug 25, 2024

HenryCaiHaiying commented Aug 26, 2024

rtreffer commented Aug 27, 2024

puellanivis commented Aug 28, 2024

HenryCaiHaiying commented Sep 4, 2024

rtreffer commented Sep 4, 2024

HenryCaiHaiying commented Aug 19, 2024 •

edited

Loading

HenryCaiHaiying commented Aug 19, 2024 •

edited

Loading

HenryCaiHaiying commented Aug 19, 2024 •

edited

Loading