-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17413: ulog replay should copy SolrQueryRequest #2765
base: main
Are you sure you want to change the base?
SOLR-17413: ulog replay should copy SolrQueryRequest #2765
Conversation
SolrQueryRequest is a non-threadsafe type, but was being shared across executor threads during UpdateLog replay. This introduces a number of issues, not the least being a ConcurrentModificationException if multiple threads happen to tweak the request 'context' simultaneously. This commit fixes this issue by giving each executor thread a unique LocalSolrQueryRequest instance to use.
Is the not-anymore-used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your change looks good;
But I noticed a problem. This ThreadLocal has a value that registers itself with procPool, a List local to the doReplay method. But the executor is shared on the server; the 2nd doReplay is called (from any UpdateLog on the server), any existing threads in the Executor will have ThreadLocals referencing procPools from a previous doReplay invocation! That previous invocation will have already finished their lifecycle (called finish()), so weirdly URPs would end up being used beyond finish() and wouldn't get another finish() called on them. Might not be super serious in practice. This could be its own JIRA issue & PR, or we generalize this issue to cover both concerns that affect the same lines of code.
The fix for the above is not so clear. I suppose both sharing an Executor and use of ThreadLocals that reference local state is a dangerous combination. Maybe we do away with the ThreadLocal and instead simply use the procPool to check out (creating if necessary) and return back instances.
Yes, actually! Meant to call this out in a comment here or in the code - the "original" LocalSolrQueryRequest is technically still accessible on the UpdateCommand seen by individual processors. So there's still the possibility this object could be messed with by multiple threads. That said - I don't think this is a risk in practice. I spent a good bit of time tracing through all of the Both from this and from the problem David described above, it's clear that the UpdateLog replay threading could use a larger rethink, but I'm reluctant to dive into that here. |
I'd like to contribute to this PR to replace the ThreadLocal with a re-using pool that we mostly already have. |
Sure - feel free to push to the branch if you've got an idea. Not totally sure what you mean by a "re-using pool", but I'm sure that'll be clearer on seeing the code. (Unless it's a large enough change that it's worth going through in detail before attempting the code? Up to you...) |
Oops meant to tag you - @dsmiley |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong about a bug regarding the ThreadLocal. This ThreadLocal is local to a method; it's not static (global). So there won't be unanticipated sharing of URP instances from one doReplay call to the next since different ThreadLocal instances are used.
final var params = | ||
new MapSolrParams( | ||
Map.of( | ||
DISTRIB_UPDATE_PARAM, | ||
FROMLEADER.toString(), | ||
DistributedUpdateProcessor.LOG_REPLAY, | ||
"true")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicative with other code in this class; it should be factored out. You can just refer to the very same params of req
, even use the same immutable variant you just coded here for that.
I started looking at why the duplication is there, and I see we create a single req & rsp for the replay as a whole that will be separate from the LocalSolrRequest in the ThreadLocal. The UpdateCommand is created from the parent request but the URP is created from the local request. That feels wrong even if it works. Admittedly it's probably too much scope to refactor (I started locally and it's not trivial).
Description
SolrQueryRequest is a non-threadsafe type, but was being shared across executor threads during UpdateLog replay. This introduces a number of issues, not the least being a ConcurrentModificationException if multiple threads happen to tweak the request 'context' simultaneously.
Solution
This commit fixes this issue by giving each executor thread a unique LocalSolrQueryRequest instance to use.
Tests
The problem is reproducible (on my hardware at least) via 'beasting'
BasicDistributedZkTest
for several hundred iterations but is otherwise hard to trigger. The test passes reliably with this fix, which can give us some confidence I think. Though if anyone can think of a better way to capture this in an automated test, I'd love to get some additional validation!Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.