Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstdMaxBufferedEncoders value is too low which caused frequent zstd encoder object creation #2965

Open
HenryCaiHaiying opened this issue Aug 19, 2024 · 11 comments

Comments

@HenryCaiHaiying
Copy link

HenryCaiHaiying commented Aug 19, 2024

Description

zstdMaxBufferedEncoders value is too low which caused many temporary zstd encoders objects being created, each encoder object is quite big (200KB) and this adds to woes of frequent GCs

Versions

| Sarama | Kafka | Go |
|1.4.0 | 3.8.0 | 1.21.5 |
| | | |

Configuration
Logs
logs: CLICK ME


Additional Context

This is related to #2964, after I did more cpu and memory profilings, I realized the frequent GC problem is caused by excessive creations of zstd encoder objects.

This is the snippet of top memory allocations of the program. In the below pprof listing, the top 2 allocation sites are from creating of zstd encoder objects (300KB each). The next 3 allocation sites are from encoding messages (2-4KB size for each message), the top 2 allocation dominates the memory usage. With this high number of memory allocations, the goroutine were frequently put into "GC assist wait" mode later when its memory allocation credits are in negative.

The root cause of frequent zstd encoder object creation is because zstdMaxBufferedEncoders is hard coded to be 1 in https://github.com/IBM/sarama/blob/main/zstd.go#L11. In a real production system where you have hundreds of brokers and producer objects, you will have hundreds of goroutines going through compression path, and this would cause frequent encoder object creation and destruction.

(pprof) top 30
Showing nodes accounting for 4701.54GB, 96.18% of 4888.34GB total
Dropped 712 nodes (cum <= 24.44GB)
Showing top 30 nodes out of 49
      flat  flat%   sum%        cum   cum%
 3606.36GB 73.77% 73.77%  3606.36GB 73.77%  github.com/klauspost/compress/zstd.encoderOptions.encoder
  840.66GB 17.20% 90.97%   840.66GB 17.20%  github.com/klauspost/compress/zstd.(*blockEnc).init
   67.22GB  1.38% 92.35%    67.22GB  1.38%  github.com/klauspost/compress/zstd.(*fastBase).ensureHist (inline)
   45.37GB  0.93% 93.28%  4656.45GB 95.26%  github.com/IBM/sarama.encode
   33.41GB  0.68% 93.96%  4605.47GB 94.21%  github.com/klauspost/compress/zstd.(*Encoder).EncodeAll
@HenryCaiHaiying
Copy link
Author

HenryCaiHaiying commented Aug 19, 2024

A sample stack trace of a goroutine being put into GC Assist Wait when its memory allocation credit is negative:

goroutine 10667 [GC assist wait]:
runtime.gopark(0xc045f85100?, 0x12daae8?, 0x0?, 0x0?, 0x74b7fd7f60a8?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc045f85048 sp=0xc045f85028 pc=0x13028ae
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.gcParkAssist()
        /usr/local/go/src/runtime/mgcmark.go:652 +0xe7 fp=0xc045f85080 sp=0xc045f85048 pc=0x12e47e7
runtime.gcAssistAlloc(0xc008282d00)
        /usr/local/go/src/runtime/mgcmark.go:509 +0x22a fp=0xc045f850e0 sp=0xc045f85080 pc=0x12e424a
runtime.deductAssistCredit(0x12d2593?)
        /usr/local/go/src/runtime/malloc.go:1291 +0x54 fp=0xc045f85108 sp=0xc045f850e0 pc=0x12d27f4
runtime.mallocgc(0x140048, 0x37cb1c0, 0x1)
        /usr/local/go/src/runtime/malloc.go:1006 +0xc9 fp=0xc045f85170 sp=0xc045f85108 pc=0x12d2029
runtime.newobject(0x4000000?)
        /usr/local/go/src/runtime/malloc.go:1328 +0x25 fp=0xc045f85198 sp=0xc045f85170 pc=0x12d28e5
github.com/klauspost/compress/zstd.encoderOptions.encoder({0x1, 0x2, 0x0, 0x0, 0x20000, 0x800000, 0x1, 0x1, 0x0, 0x0, ...})
        /go/pkg/mod/github.com/klauspost/[email protected]/zstd/encoder_options.go:60 +0x11d fp=0xc045f851b8 sp=0xc045f85198 pc=0x1fd467d
github.com/klauspost/compress/zstd.(*Encoder).initialize(0xc020c89540)
        /go/pkg/mod/github.com/klauspost/[email protected]/zstd/encoder.go:88 +0x197 fp=0xc045f852a0 sp=0xc045f851b8 pc=0x1fd1937
github.com/klauspost/compress/zstd.(*Encoder).initialize-fm()
        <autogenerated>:1 +0x25 fp=0xc045f852b8 sp=0xc045f852a0 pc=0x1fe3b45
sync.(*Once).doSlow(0x0?, 0xc000000000?)
        /usr/local/go/src/sync/once.go:74 +0xbf fp=0xc045f85318 sp=0xc045f852b8 pc=0x1344d7f
sync.(*Once).Do(...)
        /usr/local/go/src/sync/once.go:65
github.com/klauspost/compress/zstd.(*Encoder).EncodeAll(0xc020c89540, {0xc03606a380, 0x380, 0x380}, {0x0, 0x0, 0x0})
        /go/pkg/mod/github.com/klauspost/[email protected]/zstd/encoder.go:494 +0x19f fp=0xc045f85498 sp=0xc045f85318 pc=0x1fd3a1f
github.com/IBM/sarama.zstdCompress({0x0?}, {0x0, 0x0, 0x0}, {0xc03606a380, 0x380, 0x380})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/zstd.go:71 +0x86 fp=0xc045f85510 sp=0xc045f85498 pc=0x20495c6
github.com/IBM/sarama.compress(0x20?, 0xc00bf10210?, {0xc03606a380, 0x380, 0x380})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/compress.go:190 +0xdae fp=0xc045f85a38 sp=0xc045f85510 pc=0x200bb4e
github.com/IBM/sarama.(*RecordBatch).encodeRecords(0xc046511860, {0x3ea55e8?, 0xc0538882c0?})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/record_batch.go:206 +0x9a fp=0xc045f85a80 sp=0xc045f85a38 pc=0x203d7da
github.com/IBM/sarama.(*RecordBatch).encode(0xc046511860, {0x3ea55e8, 0xc0538882c0})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/record_batch.go:88 +0x231 fp=0xc045f85ae0 sp=0xc045f85a80 pc=0x203cd31
github.com/IBM/sarama.(*Records).encode(0xc0538882c0?, {0x3ea55e8?, 0xc0538882c0?})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/records.go:62 +0x98 fp=0xc045f85b28 sp=0xc045f85ae0 pc=0x203dab8
github.com/IBM/sarama.(*ProduceRequest).encode(0xc053888280, {0x3ea55e8, 0xc0538882c0})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/produce_request.go:108 +0x5e7 fp=0xc045f85cd8 sp=0xc045f85b28 pc=0x2035367
github.com/IBM/sarama.(*request).encode(0xc03bb15650, {0x3ea55e8, 0xc0538882c0})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/request.go:43 +0x13c fp=0xc045f85d00 sp=0xc045f85cd8 pc=0x203e4dc
github.com/IBM/sarama.encode({0x3e67160, 0xc03bb15650}, {0x3e91630?, 0xc00baeb440})
        /go/pkg/mod/github.com/!i!b!m/[email protected]/encoder_decoder.go:29 +0x91 fp=0xc045f85d70 sp=0xc045f85d00 pc=0x201af31
github.com/IBM/sarama.(*Broker).sendInternal(0xc008281c00, {0x3e8dc30, 0xc053888280}, 0xc0091a2780)
        /go/pkg/mod/github.com/!i!b!m/[email protected]/broker.go:1001 +0x16b fp=0xc045f85e50 sp=0xc045f85d70 pc=0x1ffc94b
github.com/IBM/sarama.(*Broker).sendWithPromise(0xc008281c00, {0x3e8dc30, 0xc053888280}, 0x0?)
        /go/pkg/mod/github.com/!i!b!m/[email protected]/broker.go:991 +0x9e fp=0xc045f85e88 sp=0xc045f85e50 pc=0x1ffc75e
github.com/IBM/sarama.(*Broker).AsyncProduce(0xc008281c00, 0xc053888280, 0xc0538882a0?)
        /go/pkg/mod/github.com/!i!b!m/[email protected]/broker.go:470 +0x1cc fp=0xc045f85f10 sp=0xc045f85e88 pc=0x1ffa88c
/encoder.opt

@HenryCaiHaiying
Copy link
Author

The fix is actually quite simple, allow zstdMaxBufferedEncoders to be specified through a Config param

@dnwe
Copy link
Collaborator

dnwe commented Aug 19, 2024

Yes so this was pinned to 1 as part of #2375 which aimed to control memory usage because the prior default behaviour had been to allocate 1 per goroutine. So the aim was to move to a shared pool, but only keep a single idle spare. It does sound like it would be useful to tradeoff static memory usage to avoid gc by keeping a larger pool ready of re-usable ones ready for use

@HenryCaiHaiying
Copy link
Author

HenryCaiHaiying commented Aug 19, 2024

@dnwe I think we should allow the user to specify the pool size. By the way, I only measured about 300KB for each zstd encoder object (when using CompressionLevel: SpeedDefault) based on github.com/klauspost/compress v1.17.4 code.

HenryCaiHaiying added a commit to HenryCaiHaiying/sarama that referenced this issue Aug 21, 2024
…pressionEncoders a configurable parameter

This PR is addressing issue: IBM#2965

For Zstd compression, currently the pool size of encoders (controlled by MaxBufferedEncoders param) is hard-coded to be 1, this caused excessive number of large encoder objects being created under high concurrency which caused GC performance issue.

Introduce a new config parameter: MaxBufferedCompressionEncoders (default value 1) and modify the calling paths to pass this parameter through message/record_batch and down to compressor/encoder.

Signed-off-by: Henry Cai <[email protected]>
@rtreffer
Copy link
Contributor

I got nerd sniped by this a bit...

I have an experimental zstd encoder pool (read: don't use it without full understanding of the code).

An initial (probably flawed) pseudo-benchmark showed some 20% cpu improvement under high goroutine counts, even with encoder reuse (removing the limit on max buffered encoders means 100% reuse, to be verified). Most systems should see way less of an effect, but that's normal for benchmarks testing a subset, in addition to better memory efficiency. I want to turn this into a proper benchmark though.

I don't want to suggest this as an alternative right now (highly experimental and with quite some complexity), but I would really like to know the settings of the original application as there is a good chance that the absolute max encoder has a higher impact than the absolute minimum, especially with a high goroutines/GOMAXPROCS ratio, which seems to be symptomatic of the original issue.

There might be an easier way to just cap the maximum outstanding (handed out) encoders, I haven't investigated this yet though.

@rtreffer
Copy link
Contributor

Ok, I managed to get a benchmark going to test different variants of pool configurations that is hopefully somewhat informative

  1. GOMAXPROCS (1 & 4)
  2. max idle encoders (1, 1000, no limit)
  3. max number of outstanding (live) encoders (goroutines, GOMAXPROCS)

Each test launches 1000 goroutines competing for encoders. Each benchmark is executed with and without alloc measurement to avoid accounting issues. The first iteration is regarded as a "pool warmup" and always discarded.

Results:

$ go test -run=^$ -count 1 -timeout 15m -benchtime=10s -bench ^BenchmarkZstdCpuAndMemoryHighConcurrency github.com/IBM/sarama
goos: linux
goarch: amd64
pkg: github.com/IBM/sarama
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics     
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCEncoderLimit-16                                 1506000              7970 ns/op         256.96 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCEncoderLimitAlloc-16                            1500000              7966 ns/op         257.09 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdle1GOMAXPROCEncoderLimit-16                            1469000              8317 ns/op         246.25 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdle1GOMAXPROCEncoderLimitAlloc-16                       1492000              8087 ns/op         253.25 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdle1-16                                                 1495000              8035 ns/op         254.90 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdle1Alloc-16                                            1488000              7999 ns/op         256.04 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdleMaxGOMAXPROCEncoderLimit-16                          1510000              7955 ns/op         257.46 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdleMaxGOMAXPROCEncoderLimitAlloc-16                     1504000              8002 ns/op         255.93 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdleMax-16                                               1482000              8085 ns/op         253.30 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyIdleMaxAlloc-16                                          1503000              7960 ns/op         257.29 MB/s             0 new/op                1.000 open_encs                 1.001 reuses/op                 0 waits/op         2197 B/op          4 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4GOMAXPROCEncoderLimit-16                      4646000              2646 ns/op         774.06 MB/s             0 new/op                4.000 open_encs                 1.000 reuses/op                 0.9779 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4GOMAXPROCEncoderLimitAlloc-16                 4495000              2683 ns/op         763.38 MB/s             0 new/op                4.000 open_encs                 1.000 reuses/op                 0.9770 waits/op            2149 B/op          3 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4Idle1GOMAXPROCEncoderLimit-16                 3700000              3256 ns/op         629.03 MB/s             0.004118 new/op                 4.000 open_encs                 0.9962 reuses/op                0.9729 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4Idle1GOMAXPROCEncoderLimitAlloc-16            3723000              3256 ns/op         629.06 MB/s             0.004092 new/op                 4.000 open_encs                 0.9962 reuses/op                0.9721 waits/op            8891 B/op          3 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4Idle1-16                                       525000             27478 ns/op          74.53 MB/s             0.3383 new/op         122.0 open_encs           0.6636 reuses/op                0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4Idle1Alloc-16                                  481000             26266 ns/op          77.97 MB/s             0.3322 new/op         119.7 open_encs           0.6698 reuses/op                0 waits/op       548680 B/op         15 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4IdleMaxGOMAXPROCEncoderLimit-16               4329000              2704 ns/op         757.39 MB/s             0 new/op                4.000 open_encs                 1.000 reuses/op                 0.9786 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4IdleMaxGOMAXPROCEncoderLimitAlloc-16          4478000              2697 ns/op         759.50 MB/s             0 new/op                4.000 open_encs                 1.000 reuses/op                 0.9780 waits/op            2149 B/op          3 allocs/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4IdleMax-16                                    2541000              4454 ns/op         459.78 MB/s             0.0002601 new/op               12.31 open_encs          1.000 reuses/op                 0 waits/op
BenchmarkZstdCpuAndMemoryHighConcurrencyGOMAXPROCS4IdleMaxAlloc-16                               2723000              4796 ns/op         427.06 MB/s             0.0002629 new/op               11.83 open_encs          1.000 reuses/op                 0 waits/op         2651 B/op          3 allocs/op
PASS
ok      github.com/IBM/sarama   367.935s

The GOMAXPROCS=1 turn out to be the same for all variants. The GOMAXPROCS=4 cases turn out to be interesting.

Max Idle Encoders Max Outstanding Encoders Throughput alloc bytes per op
- GOMAXPROCS 774.06MB/s 2149
1000 GOMAXPROCS 757.39MB/s 2149
1 GOMAXPROCS 629.03MB/s 8891
1000 1000 459.78MB/s 2651
1 1000 74.53MB/s 548680

At high goroutine to GOMAXPROC counts it seems like limiting the in use encoders to GOMAXPROCS beats raising the idle encoders. On top of that it should conserve memory and avoid issues for high-compression use cases.

There is a high memory tax to keeping many idle or in use encoders. So I am a bit skeptical about just raising the idle encoders max as I think limiting the maximum number of encoders in use is highly beneficial, and potentially more beneficial for more usecases. The benchmark indicates that it can beat this synthetic case by another 68%. I would really like to see the goroutine to GOMAXPROC ratio and how this plays out. There is a chance that either of the 2 directions is better for the specific setup.

That said: the benchmark shows a potential for a 10x improvement for a rather common (anti)pattern: get a batch, split it into entries, launch a goroutine per entry. This is pretty huge. And even the the min encoders alone is a 5x improvement.

If I could only find a low complexity solution that offers these benefits.....

@HenryCaiHaiying
Copy link
Author

@rtreffer Thanks for so quickly creating the encoder pool code and the benchmark test. I think the reason that maxIdleEncoders=4 's throughput is better than maxIdleEncoders=100 is because maxIdleEncoders=4 spent much less time on creating encoders objects (the max is capped at 4). Instead maxIdleEncoders=4 spend more time waiting on encoders becomes available from channel: https://github.com/rtreffer/sarama/blob/zstd-pool/zstd_pool.go#L113. In your benchmark test, the goroutine is only doing zstd compression on 2KB data which is very fast for it to turn around to return the encoder to the pool/channel. I don't remember all the details in sarama code, that goroutine might have other work to do (e.g. sending data through tcp connection to downstream). So depending on the time ratio of the goroutine spend (the time spend on its real computation work and the time on pool related activities), the pool reuse might not be that good.

Also this new pool code (especially the line on waiting for encoder from channel: https://github.com/rtreffer/sarama/blob/zstd-pool/zstd_pool.go#L113) changes the object creation dynamic from the old static MaxBufferedEncoders=1 code where there was no upper limit on how many encoders can be created. In the old code, I can see hundreds of those encoder objects being created all the time.

@rtreffer
Copy link
Contributor

I think I finally found an easy way to express this all in code: rtreffer@23168b5

$ git diff --stat HEAD~1
 zstd.go | 53 ++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 13 deletions(-)

I can try to get this into a PR form soon, ideally with a benchmark. I wouldn't want to propose this without cleaning up the benchmark.

@puellanivis
Copy link
Contributor

I’m not sure you have to worry about using an atomic.Int32 for zstdCheckedOutEncoders. Your code only ever manipulates or accesses it while holding the global lock. (Though, with the atomic, the whole releaseEncoder function doesn’t need to be grabbing the lock at all.)

@HenryCaiHaiying
Copy link
Author

@rtreffer Are you still moving forward with your PR (rtreffer@23168b5) ? I left a comment in your PR.

@rtreffer
Copy link
Contributor

rtreffer commented Sep 4, 2024

I've been pretty busy over the last few days, I expect to have some time for this over the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants