-
Notifications
You must be signed in to change notification settings - Fork 19
[Scanner] Handle leader availability #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@luoyuxia would appreciate suggestions on how to test this per current test setup. I've attempted local tests:
Neither scenarios reaches the code path. The second test triggers failure in the form of RPC failure. |
Let's focuse on case 2. Will RPC failure panic the program? If it panic the program, could you please create a issue to post the exception? I 'll dive into the issue. But we can still push forward this issue. |
luoyuxia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leekeiabstraction Thanks for the pr. Besides,
I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket
With the latest changes from main, I no longer see any RPC failure causing panic. |
@luoyuxia Thanks for pointing me to this. Just one question, I notice that we currently do not track partitions on cluster as per java implementation. Is this intentional or a gap that needs to be addressed? |
Yes, it's a gap that needs to be addressed. In #99 we will return error to tell it's not supported now. I think we can just consider non-partitioned table now. |
8e6d5d5 to
a7c063c
Compare
I've pushed the changes to handle this specific case. However, I am not certain on how to simulate the particular scenario. When using a local cluster:
|
1: yes, you're correct. It's not easy to mock leader not avaiable. You have to mock create lots of replica to enfore leader election cost a while. If hard, I think we can just ignore testing. 2: it looks werid, the exception should only be thrown while create table. I think there must be some issue in the fluss-rust client. |
…ility and recovery.
…ility and recovery.
…ility and recovery.
…ility and recovery.
|
Would appreciate review on the latest changes. I've tested with changing leader, all tablet servers down
|
|
Manual tests performed on latest changes are as follow
@luoyuxia Would appreciate a review here. Thank you! |
|
@leekeiabstraction Thanks for the greate work. It make the scanner stable. I'll review today later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses issue #42 by implementing handling for scenarios when bucket leaders are not available, aligning with the Java client's behavior of logging warnings in such cases. The changes enhance error resilience in the scanner and metadata subsystems.
Key Changes:
- Added poison detection for server connections with automatic cleanup on connection errors
- Modified cluster API to return
Optioninstead of panicking when no servers are available - Implemented metadata invalidation for failed servers and graceful retry logic in the scanner
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/fluss/src/rpc/server_connection.rs | Added poison state detection and connection cleanup on connection errors |
| crates/fluss/src/cluster/cluster.rs | Added invalidate_server method and changed get_one_available_server to return Option |
| crates/fluss/src/client/table/scanner.rs | Added metadata update checks, improved error handling with retry logic, and leader availability handling |
| crates/fluss/src/client/metadata.rs | Added server invalidation, cluster reinitialization support, and fixed variable name typos |
| crates/fluss/src/client/credentials.rs | Updated to handle optional server availability with expect |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
luoyuxia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leekeiabstraction Thanks for the pr. LGTM!
Purpose
Linked issue: close #42
Brief change log
Add server unavailability handling in log scanner. Allows client to recover when leader fails over in various scenarios.
Tests
Manual tests performed on latest changes are as follow