Skip to content

Conversation

@leekeiabstraction
Copy link
Contributor

@leekeiabstraction leekeiabstraction commented Dec 20, 2025

Purpose

Linked issue: close #42

Brief change log

Add server unavailability handling in log scanner. Allows client to recover when leader fails over in various scenarios.

Tests

Manual tests performed on latest changes are as follow

Scenario Expected Behaviour Testing Result
Start scanner with leader unavailable Scanner connects to leader when leader becomes available Passed
Single replica table leader becomes unavailable Scanner connects to leader when leader becomes available Passed
Multi replica table leader becomes unavailable Scanner connects to new leader Passed

@leekeiabstraction
Copy link
Contributor Author

@luoyuxia would appreciate suggestions on how to test this per current test setup.

I've attempted local tests:

  1. Delete table in between polls
  2. Remove tablet server inbetween polls

Neither scenarios reaches the code path. The second test triggers failure in the form of RPC failure.

@luoyuxia
Copy link
Contributor

luoyuxia commented Dec 21, 2025

@luoyuxia would appreciate suggestions on how to test this per current test setup.

I've attempted local tests:

  1. Delete table in between polls
  2. Remove tablet server inbetween polls

Neither scenarios reaches the code path. The second test triggers failure in the form of RPC failure.

Let's focuse on case 2. Will RPC failure panic the program? If it panic the program, could you please create a issue to post the exception? I 'll dive into the issue. But we can still push forward this issue.

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leekeiabstraction Thanks for the pr. Besides,
I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

@leekeiabstraction
Copy link
Contributor Author

Let's focuse on case 2. Will RPC failure panic the program? If it panic the program, could you please create a issue to post the exception?

With the latest changes from main, I no longer see any RPC failure causing panic.

@leekeiabstraction leekeiabstraction changed the title [Scanner] Logs warning when bucket leader does not exist [Scanner] Handle leader availability Dec 22, 2025
@leekeiabstraction
Copy link
Contributor Author

I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

@luoyuxia Thanks for pointing me to this. Just one question, I notice that we currently do not track partitions on cluster as per java implementation. Is this intentional or a gap that needs to be addressed?

@luoyuxia
Copy link
Contributor

I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

@luoyuxia Thanks for pointing me to this. Just one question, I notice that we currently do not track partitions on cluster as per java implementation. Is this intentional or a gap that needs to be addressed?

Yes, it's a gap that needs to be addressed. In #99 we will return error to tell it's not supported now. I think we can just consider non-partitioned table now.

@leekeiabstraction
Copy link
Contributor Author

@leekeiabstraction Thanks for the pr. Besides,
I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

I've pushed the changes to handle this specific case. However, I am not certain on how to simulate the particular scenario. When using a local cluster:

  1. Scenario A, stop tablet server in between polls: If I understand correctly, the current Java implementation of checkAndUpdateMetadata() (of which we're following) does not update metadata if bucket leader was previously available, it only updates it if bucket leader was previously unavailable. Additionally, LogFetcher.send_fetches() currently just logs fetch failures, leader status is not updated.
  2. Scenario B, start LogScanner with tablet server down: I run into issue before instantiating LogScanner when calling admin.get_table(): Error: FlussAPIError { api_error: ApiError { code: 17, message: "Replication factor: 1 larger than available tablet servers: 0." } }.

@luoyuxia
Copy link
Contributor

@leekeiabstraction Thanks for the pr. Besides,
I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

I've pushed the changes to handle this specific case. However, I am not certain on how to simulate the particular scenario. When using a local cluster:

  1. Scenario A, stop tablet server in between polls: If I understand correctly, the current Java implementation of checkAndUpdateMetadata() (of which we're following) does not update metadata if bucket leader was previously available, it only updates it if bucket leader was previously unavailable. Additionally, LogFetcher.send_fetches() currently just logs fetch failures, leader status is not updated.
  2. Scenario B, start LogScanner with tablet server down: I run into issue before instantiating LogScanner when calling admin.get_table(): Error: FlussAPIError { api_error: ApiError { code: 17, message: "Replication factor: 1 larger than available tablet servers: 0." } }.

1: yes, you're correct. It's not easy to mock leader not avaiable. You have to mock create lots of replica to enfore leader election cost a while. If hard, I think we can just ignore testing.

2: it looks werid, the exception should only be thrown while create table. I think there must be some issue in the fluss-rust client.

@leekeiabstraction
Copy link
Contributor Author

leekeiabstraction commented Dec 23, 2025

@luoyuxia

Would appreciate review on the latest changes. I've tested with changing leader, all tablet servers down and then recovered scenario, the latest changes seem to be working fine with one exception:

We currently do not treat ConnectionError("IO Error: Connection refused (os error 61)") as poisoned, I think we should. In my manual testing (at least on MacOS), once a connection gets into this state, the connection is not recoverable even if the server behind it is running again.

@leekeiabstraction
Copy link
Contributor Author

Manual tests performed on latest changes are as follow

Scenario Expected Behaviour Testing Result
Start scanner with leader unavailable Scanner connects to leader when leader becomes available Passed
Single replica table leader becomes unavailable Scanner connects to leader when leader becomes available Passed
Multi replica table leader becomes unavailable Scanner connects to new leader Passed

@luoyuxia Would appreciate a review here. Thank you!

@luoyuxia
Copy link
Contributor

luoyuxia commented Dec 24, 2025

@leekeiabstraction Thanks for the greate work. It make the scanner stable. I'll review today later

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #42 by implementing handling for scenarios when bucket leaders are not available, aligning with the Java client's behavior of logging warnings in such cases. The changes enhance error resilience in the scanner and metadata subsystems.

Key Changes:

  • Added poison detection for server connections with automatic cleanup on connection errors
  • Modified cluster API to return Option instead of panicking when no servers are available
  • Implemented metadata invalidation for failed servers and graceful retry logic in the scanner

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
crates/fluss/src/rpc/server_connection.rs Added poison state detection and connection cleanup on connection errors
crates/fluss/src/cluster/cluster.rs Added invalidate_server method and changed get_one_available_server to return Option
crates/fluss/src/client/table/scanner.rs Added metadata update checks, improved error handling with retry logic, and leader availability handling
crates/fluss/src/client/metadata.rs Added server invalidation, cluster reinitialization support, and fixed variable name typos
crates/fluss/src/client/credentials.rs Updated to handle optional server availability with expect

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

leekeiabstraction and others added 5 commits December 24, 2025 12:36
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
leekeiabstraction and others added 2 commits December 24, 2025 12:46
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leekeiabstraction Thanks for the pr. LGTM!

@luoyuxia luoyuxia merged commit f166314 into apache:main Dec 25, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scanner should handle when bucket leader doesn't exist.

2 participants