Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17413: ulog replay should copy SolrQueryRequest #2765

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gerlowskija
Copy link
Contributor

Description

SolrQueryRequest is a non-threadsafe type, but was being shared across executor threads during UpdateLog replay. This introduces a number of issues, not the least being a ConcurrentModificationException if multiple threads happen to tweak the request 'context' simultaneously.

Solution

This commit fixes this issue by giving each executor thread a unique LocalSolrQueryRequest instance to use.

Tests

The problem is reproducible (on my hardware at least) via 'beasting' BasicDistributedZkTest for several hundred iterations but is otherwise hard to trigger. The test passes reliably with this fix, which can give us some confidence I think. Though if anyone can think of a better way to capture this in an automated test, I'd love to get some additional validation!

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

SolrQueryRequest is a non-threadsafe type, but was being shared across
executor threads during UpdateLog replay.  This introduces a number of
issues, not the least being a ConcurrentModificationException if
multiple threads happen to tweak the request 'context' simultaneously.

This commit fixes this issue by giving each executor thread a unique
LocalSolrQueryRequest instance to use.
@HoustonPutman
Copy link
Contributor

Is the not-anymore-used req object still used somewhere else? And we are sure there aren't other settings that live in that request?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your change looks good;

But I noticed a problem. This ThreadLocal has a value that registers itself with procPool, a List local to the doReplay method. But the executor is shared on the server; the 2nd doReplay is called (from any UpdateLog on the server), any existing threads in the Executor will have ThreadLocals referencing procPools from a previous doReplay invocation! That previous invocation will have already finished their lifecycle (called finish()), so weirdly URPs would end up being used beyond finish() and wouldn't get another finish() called on them. Might not be super serious in practice. This could be its own JIRA issue & PR, or we generalize this issue to cover both concerns that affect the same lines of code.

The fix for the above is not so clear. I suppose both sharing an Executor and use of ThreadLocals that reference local state is a dangerous combination. Maybe we do away with the ThreadLocal and instead simply use the procPool to check out (creating if necessary) and return back instances.

solr/core/src/java/org/apache/solr/update/UpdateLog.java Outdated Show resolved Hide resolved
@gerlowskija
Copy link
Contributor Author

Is the not-anymore-used req object still used somewhere else?

Yes, actually! Meant to call this out in a comment here or in the code - the "original" LocalSolrQueryRequest is technically still accessible on the UpdateCommand seen by individual processors. So there's still the possibility this object could be messed with by multiple threads.

That said - I don't think this is a risk in practice. I spent a good bit of time tracing through all of the UpdateCommand.getReq() invocations, and they're all "read only" calls to get either the SolrCore or the IndexSchema relevant to this request. (Both of which are commonly used across threads.)

Both from this and from the problem David described above, it's clear that the UpdateLog replay threading could use a larger rethink, but I'm reluctant to dive into that here.

@dsmiley
Copy link
Contributor

dsmiley commented Oct 16, 2024

I'd like to contribute to this PR to replace the ThreadLocal with a re-using pool that we mostly already have.

@gerlowskija
Copy link
Contributor Author

Sure - feel free to push to the branch if you've got an idea.

Not totally sure what you mean by a "re-using pool", but I'm sure that'll be clearer on seeing the code. (Unless it's a large enough change that it's worth going through in detail before attempting the code? Up to you...)

@gerlowskija
Copy link
Contributor Author

Oops meant to tag you - @dsmiley

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about a bug regarding the ThreadLocal. This ThreadLocal is local to a method; it's not static (global). So there won't be unanticipated sharing of URP instances from one doReplay call to the next since different ThreadLocal instances are used.

Comment on lines +2101 to +2107
final var params =
new MapSolrParams(
Map.of(
DISTRIB_UPDATE_PARAM,
FROMLEADER.toString(),
DistributedUpdateProcessor.LOG_REPLAY,
"true"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicative with other code in this class; it should be factored out. You can just refer to the very same params of req, even use the same immutable variant you just coded here for that.

I started looking at why the duplication is there, and I see we create a single req & rsp for the replay as a whole that will be separate from the LocalSolrRequest in the ThreadLocal. The UpdateCommand is created from the parent request but the URP is created from the local request. That feels wrong even if it works. Admittedly it's probably too much scope to refactor (I started locally and it's not trivial).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants