Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8511: Overflow appeared when the batch reached rows limit #2943

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rymarm
Copy link
Member

@rymarm rymarm commented Sep 18, 2024

DRILL-8511: Overflow appeared when the batch reached rows limit

Description

The size-aware scan framework fails to end the batch.

Framework tries to reallocate the vector on batch end, due to a hidden, minor bug in BitColumnWriter - which in general is not notable, but in a specific case, when the initial vector allocation size limit is exceeded and a reader reaches the batch row size limit.

BitColumnWriter uses instead of a write index a value count and this causes unexpected vector reallocation (look at the changes).

Documentation

No changes required.

Testing

Manual tests

@rymarm rymarm marked this pull request as draft September 19, 2024 13:29
@paul-rogers
Copy link
Contributor

@rymarm, thanks for finding this issue. I believe your fix is correct. The prepareWrite() call exists to fill any remaining bits with 0s (i.e. "fill empties."). The number to be filled is, indeed, valueCount - 1. So, I think this is a good fix.

I would feel more confident if we had a unit test. That this bug exists suggests that, despite the rather excessive number of tests I wrote way back when, I somehow omitted this case. Perhaps a reason is that the bit vector is special: it uses code different than the other vectors, and so needs its own tests. Tests for the fixed-width vectors are in TestFixedWidthWriter and TestFillEmpties. Perhaps you can whip up a TestBitVector that does a few tests, including for the special case you uncovered. In particular, it would be good to make sure that the following cases work:

  • Write some values to the bit vector to simulate the first few records setting the bits.
  • Claim that the batch has ended and the batch wrote a bunch of other records without writing to the bit vector.
  • Call setValueCount() with a number that is a multiple of 8, a multiple of 8 plus one, and a multiple of 8 minus one. (three cases)
  • Do this also so the value count/8 equals the initial allocation size, is below that size, or is above that size (three cases)

This gives us 9 cases, some that will force a resize, some that should just fit in the initial allocation.

Then, read the bits out of the vector, verifying that the "empty" bits are the defaultValue (the "default default" is 0).

Why write tests? It is less embarrassing for a test to point out a problem than to find it in a production system. I apologize that I somehow missed doing the tests in the first place.

The second concern is whether this same error was made for other writers. The answer seems to be "no": the other writers use a different structure for setValueCount(). As the comment in BitColumnWriter says, it is the only vector that defers to the mechanism in the vector's own mutator. The result is that the bit vector code differs from the other vectors which handle the entire write process themselves.

@rymarm
Copy link
Member Author

rymarm commented Sep 23, 2024

@paul-rogers thank you so much for the review! I found and fixed one mistake which the failed tests highlighted already and will write additional tests for the cases you described.

I agree that test-driven development is much safer and better than a simple fix without test coverage of the affected place. It is just hard to me to write tests for Drill, they looks complicated and not isolated to me. But I'll write them.

@paul-rogers
Copy link
Contributor

@rymarm, thanks for looking at the tests. While writing tests for Drill can be hard, we developed a bunch of tools for particular cases. In the case of the vector writers, those tests I referenced above should show you what to do.

What I did, back when I wrote this stuff, is to run the tests in my IDE. I've not had much luck in recent years running the whole unit test suite locally. But, I can still generally run individual unit tests. A handy way to proceed is to first create a new file that has the test frameworks included. Then, write one simple test that just writes to the vector and reads back the data. At that point, it should be clear how to create other cases.

It would be great to temporarily revert your change and create a case that fails, then restore the change and ensure things work.

Please let me know if creating the tests takes more than a small amount of time. I can perhaps help out by creating a basic bit vector use case (if I can remember how all that code works!)

@cgivre
Copy link
Contributor

cgivre commented Sep 23, 2024

Just an FYI, but the Github actions are not working well at the moment. The two that use Hadoop 2 seem to have some connectivity issue with Hive. @jnturton is working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants