-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Retryable grpc client #47981
base: master
Are you sure you want to change the base?
[Core] Retryable grpc client #47981
Conversation
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
const int64_t timeout_ms) { | ||
auto executor = new Executor( | ||
[callback](const ray::Status &status) { callback(status, Reply()); }); | ||
std::weak_ptr<RetryableGrpcClient> weak_self = shared_from_this(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This weak_ptr handles the case when RetryableGrpcClient is already destructed, after which the
operation_callback is called. However we don't have same treatments to &grpc_client
which faces similar issues. We need to think about the lifetimes of a single call, a retryable call, &grpc_client and RetryableGrpcClient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove weak_ptr
and rely on shutdown_
.
Signed-off-by: Jiajun Yao <[email protected]>
executor->Abort(ray::Status::TimedOut(absl::StrFormat( | ||
"Timed out while waiting for %s to become available.", server_name_))); | ||
pending_requests_bytes_ -= request_bytes; | ||
delete executor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can we make sure the executor::Execute's callback is never called by grpc?
<< "limit. Blocking the current thread until network is recovered"; | ||
while (self->server_is_unavailable_ && !self->shutdown_) { | ||
self->CheckChannelStatus(false); | ||
std::this_thread::sleep_for(std::chrono::milliseconds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking, is this harmful to sleep the thread since it may be on an asio event loop? can we make it a post
?
/*server_unavailable_timeout_seconds=*/ | ||
std::numeric_limits<uint64_t>::max(), | ||
/*server_unavailable_timeout_callback=*/ | ||
[]() { RAY_LOG(FATAL) << "Server unavailable should never timeout"; }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there other possible handler for server unavailable? if not we can build the FATAL in RetryableGrpcClient
Signed-off-by: Jiajun Yao <[email protected]>
Why are these changes needed?
Currently gcs_rpc_client has retries for gcs rpc calls and this PR moves the retry functionality to RetryableGrpcClient so that it can be used by non-gcs rpc client (e.g. core worker client).
Also enable retry for
ReportGeneratorItemReturns
rpc since it's idempotent.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.