Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java: the misses in rocksdbjni are very expensive #13023

Open
trazfr opened this issue Sep 19, 2024 · 8 comments · May be fixed by #13049
Open

Java: the misses in rocksdbjni are very expensive #13023

trazfr opened this issue Sep 19, 2024 · 8 comments · May be fixed by #13049

Comments

@trazfr
Copy link

trazfr commented Sep 19, 2024

Expected behavior

This is only for rocksdbjni.

In rocksdbjni, I would expect the misses to be less expensive than the hits, or at least as expensive.
I have tested v9.5.2, but the last v9.6.1 has no change in java/rocksjni/rocksjni.cc so the issue remains.

Actual behavior

The misses are far more expensive than the hits due to a C++ exception which is thrown then caught a few lines after.

For instance in java/rocksjni/rocksjni.cc in Java_org_rocksdb_RocksDB_get__J_3BII():

    // throws a rocksdb::KVException...
    ROCKSDB_NAMESPACE::KVException::ThrowOnError(
        env,
        db->Get(ROCKSDB_NAMESPACE::ReadOptions(), db->DefaultColumnFamily(),
                key.slice(), &value.pinnable_slice()));
    return value.NewByteArray();

  } catch (ROCKSDB_NAMESPACE::KVException&) {
    // ... which is caught a few lines after
    return nullptr;
  }

Just changing the above code to the following speeds up a lot the misses:

    const auto status =
        db->Get(ROCKSDB_NAMESPACE::ReadOptions(), db->DefaultColumnFamily(),
                key.slice(), &value.pinnable_slice()));
    if (status.ok()) {
      return value.NewByteArray();
    } else if (!status.IsNotFound()) {
      ROCKSDB_NAMESPACE::RocksDBExceptionJni::ThrowNew(env, status);
    }
  } catch (ROCKSDB_NAMESPACE::KVException&) {
  }
  return nullptr;

Steps to reproduce the behavior

I have created a small reproducer in trazfr/rocksdbjni-miss-perfs which runs the two following rocksdbjni jar and send the results to the Datadog profiler:

  • the vanilla rocksdb 9.5.2
  • a patched version with only the change above
# needs docker-compose
./run.sh --api-key <DATADOG_API_KEY> --site <DATADOG_SITE>

In this reproducer, we create a DB with 1M entries, then we perform:

  • 1M hits with all the keys in the DB (method App.readDatabaseHits())
  • 1M misses (method App.readDatabaseMisses())
  • loop

I don't compare vanilla vs patched as the compilation parameters are different, but only the CPU time spent on hits vs misses in each case. The following tests are done on an AWS m5d.4xlarge machine (linux/x84_64), but I have also been able to reproduce an a M1 Mac (MacOS/aarch64 and qemu+linux/aarch64).

On the vanilla version, in case of miss, we can see that a lot of CPU time is spent in rocksdb::KVException::ThrowOnError(). This makes the misses slower than the hits by far:

image

With the small patch above, as we don't throw exceptions, the misses are a bit less expensive than the hits:

image

@rhubner
Copy link
Contributor

rhubner commented Sep 20, 2024

Hello @trazfr,

thank you for amazing report. I agree that using exceptions for misses is probably not the best. I will check with my colleagues and see how we can optimize this. Probably leave exception on for error cases.

Radek

cc: @adamretter

@adamretter
Copy link
Collaborator

adamretter commented Sep 21, 2024

Thanks @rhubner, it looks like this try/catch approach in the C++ code was originally pioneered by @alanpaxton (see 5a063ec). @alanpaxton Would you be able to take a look at this please? I think a good first step would be to construct some JMH Benchmarks outside of RocksDB in our JNI Benchmarks project to look at the different approaches we could use here and their performance, before we decide how to move forward here. What do you think?

@adamretter
Copy link
Collaborator

@trazfr from your report and attached images I can't seem to ascertain the numbers for the performance difference. Would you be able to tell me a bit more in detail about the performance difference please?

@trazfr
Copy link
Author

trazfr commented Sep 23, 2024

👋 Yes, no problem.

Basically the first thing we can see is the difference of time spent between the 1M cache misses and 1M cache hits.
To see it, you don't need a Datadog account, just run with the dummy API key and a dummy datadog site, the datadog agent will just fail to upload the profiles:

$ git clone https://github.com/trazfr/rocksdbjni-miss-perfs.git
$ cd rocksdbjni-miss-perfs
# The datadog agent will fail to POST to localhost, but this does not change the run.
$ ./run.sh --api-key xxx --site localhost

Running on AWS m5d.4xlarge, in the logs, there is a huge difference between performing 1M hits which take about 3 seconds and 1M misses which takes 5 seconds...:

$ docker logs rocksdbjni-miss-perfs-test-rocksdb-vanilla-1 --tail 11
Read the database with 1000000 keys that are in the database
Time: 3312.024621ms
Query the database with 1000000 keys that are NOT in the database
Time: 5206.450207ms
Read the database with 1000000 keys that are in the database
Time: 3301.095231ms
Query the database with 1000000 keys that are NOT in the database
Time: 5174.598296ms
Read the database with 1000000 keys that are in the database
Time: 3309.393501ms

This is not architecture-specific, as running on Mac M1 with linux/aarch64 under qemu:

$ docker logs rocksdbjni-miss-perfs-test-rocksdb-vanilla-1 --tail 11
Read the database with 1000000 keys that are in the database
Time: 2531.553673ms
Query the database with 1000000 keys that are NOT in the database
Time: 4084.447483ms
Read the database with 1000000 keys that are in the database
Time: 2523.187492ms
Query the database with 1000000 keys that are NOT in the database
Time: 4069.455081ms
Read the database with 1000000 keys that are in the database
Time: 2513.968017ms

Note: the keys to perform hit and miss are interleaved to try to compare them:

  • the hits are with keys in [00000000, 00000002, ..., 01999998]
  • the misses are with keys in [00000001, 00000003, ..., 01999999]

@trazfr
Copy link
Author

trazfr commented Sep 23, 2024

Please find attached some new screenshots of profiles running on m5d.4xlarge. As both the misses and hits run one just after the other, they are both displayed on the same graphs.

  • hits: RocksDB.get(byte[]) is 38% of the profile RocksDB.get(byte[]) - hits
  • misses: RocksDB.get(byte[]) is 60% of the profile RocksDB.get(byte[]) - misses

Overall: the misses are >50% more expensive than miss in terms of CPU.

  • hits: rocksdb::DBImpl::Get() is ~33% of the profile
    rocksdb::DBImpl::Get() - hits
  • misses: rocksdb::DBImpl::Get() is ~37% of the profile
    rocksdb::DBImpl::Get() - misses

Overall: the misses in rocksdb::DBImpl::Get() are ~10% more expensive than miss in terms of CPU, but for now, I am concentrating on the next part.

In the end rocksdb::KVException::ThrowOnError() is ~21% of the profile rocksdb::KVException::ThrowOnError() - misses

Throwing the KVException exception accounts for a bit more than 1/3 of the overall CPU (~13s out of ~36s) of the RocksDB.get() method in case of miss.

@trazfr
Copy link
Author

trazfr commented Sep 25, 2024

Please find attached one of the jfr which were used to generate the screenshots above.
main.jfr.zip

For instance under JDK Mission Control: Event Browser -> Datadog -> Profiling -> Method CPU Profiling Sample, we get the same kind of flame graph on 1 minute:

image

alanpaxton added a commit to evolvedbinary/rocksdb that referenced this issue Oct 2, 2024
Performance of NotFound case has been bad - see facebook#13023
Some jmh tests to measure this, prior to fixing it.
@trazfr
Copy link
Author

trazfr commented Oct 3, 2024

👋 Thanks a lot @alanpaxton, I have just tested the PR #13049

I have updated my small repo to test it: https://github.com/trazfr/rocksdbjni-miss-perfs

This is fixing this issue. The performances of misses are about the same as hits (and even a bit faster).

Datadog profile under an AWS m5d.4xlarge at 3519c07:

Hits:
image

Misses:
image

Example of JFR: main.jfr.zip

@alanpaxton
Copy link
Contributor

Hi @trazfr great news, thanks - that more than confirms the results I was seeing from some JMH tests I added in the PR - hopefully once a colleague has reviewed it, the RocksDB core team can help us to get this merged for you soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants