[Scanner] Handle leader availability #102

leekeiabstraction · 2025-12-20T13:10:55Z

Purpose

Linked issue: close #42

Brief change log

Add server unavailability handling in log scanner. Allows client to recover when leader fails over in various scenarios.

Tests

Manual tests performed on latest changes are as follow

Scenario	Expected Behaviour	Testing Result
Start scanner with leader unavailable	Scanner connects to leader when leader becomes available	Passed
Single replica table leader becomes unavailable	Scanner connects to leader when leader becomes available	Passed
Multi replica table leader becomes unavailable	Scanner connects to new leader	Passed

leekeiabstraction · 2025-12-20T13:15:54Z

@luoyuxia would appreciate suggestions on how to test this per current test setup.

I've attempted local tests:

Delete table in between polls
Remove tablet server inbetween polls

Neither scenarios reaches the code path. The second test triggers failure in the form of RPC failure.

luoyuxia · 2025-12-21T03:57:03Z

@luoyuxia would appreciate suggestions on how to test this per current test setup.

I've attempted local tests:

Delete table in between polls

Remove tablet server inbetween polls

Neither scenarios reaches the code path. The second test triggers failure in the form of RPC failure.

Let's focuse on case 2. Will RPC failure panic the program? If it panic the program, could you please create a issue to post the exception? I 'll dive into the issue. But we can still push forward this issue.

luoyuxia

@leekeiabstraction Thanks for the pr. Besides,
I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

leekeiabstraction · 2025-12-22T12:37:23Z

Let's focuse on case 2. Will RPC failure panic the program? If it panic the program, could you please create a issue to post the exception?

With the latest changes from main, I no longer see any RPC failure causing panic.

leekeiabstraction · 2025-12-22T15:25:49Z

I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

@luoyuxia Thanks for pointing me to this. Just one question, I notice that we currently do not track partitions on cluster as per java implementation. Is this intentional or a gap that needs to be addressed?

luoyuxia · 2025-12-22T21:51:37Z

I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

@luoyuxia Thanks for pointing me to this. Just one question, I notice that we currently do not track partitions on cluster as per java implementation. Is this intentional or a gap that needs to be addressed?

Yes, it's a gap that needs to be addressed. In #99 we will return error to tell it's not supported now. I think we can just consider non-partitioned table now.

leekeiabstraction · 2025-12-22T23:24:55Z

@leekeiabstraction Thanks for the pr. Besides,
I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

I've pushed the changes to handle this specific case. However, I am not certain on how to simulate the particular scenario. When using a local cluster:

Scenario A, stop tablet server in between polls: If I understand correctly, the current Java implementation of checkAndUpdateMetadata() (of which we're following) does not update metadata if bucket leader was previously available, it only updates it if bucket leader was previously unavailable. Additionally, LogFetcher.send_fetches() currently just logs fetch failures, leader status is not updated.
Scenario B, start LogScanner with tablet server down: I run into issue before instantiating LogScanner when calling admin.get_table(): Error: FlussAPIError { api_error: ApiError { code: 17, message: "Replication factor: 1 larger than available tablet servers: 0." } }.

luoyuxia · 2025-12-23T01:35:55Z

@leekeiabstraction Thanks for the pr. Besides,
I think we should add https://github.com/apache/fluss/blob/main/fluss-client/src/main/java/org/apache/fluss/client/table/scanner/log/LogFetcher.java#L173
to update metadata if leader is not found. Othwise, we'll never be able to fetch the bucket if the the leader is not found at first. Retry should get the leader and make it's possible to fetch the bucket

I've pushed the changes to handle this specific case. However, I am not certain on how to simulate the particular scenario. When using a local cluster:

Scenario A, stop tablet server in between polls: If I understand correctly, the current Java implementation of checkAndUpdateMetadata() (of which we're following) does not update metadata if bucket leader was previously available, it only updates it if bucket leader was previously unavailable. Additionally, LogFetcher.send_fetches() currently just logs fetch failures, leader status is not updated.

Scenario B, start LogScanner with tablet server down: I run into issue before instantiating LogScanner when calling admin.get_table(): Error: FlussAPIError { api_error: ApiError { code: 17, message: "Replication factor: 1 larger than available tablet servers: 0." } }.

1: yes, you're correct. It's not easy to mock leader not avaiable. You have to mock create lots of replica to enfore leader election cost a while. If hard, I think we can just ignore testing.

2: it looks werid, the exception should only be thrown while create table. I think there must be some issue in the fluss-rust client.

…ility and recovery.

leekeiabstraction · 2025-12-23T23:38:30Z

@luoyuxia

Would appreciate review on the latest changes. I've tested with changing leader, all tablet servers down ~~and then recovered scenario, the latest changes seem to be working fine with one exception:~~

We currently do not treat ConnectionError("IO Error: Connection refused (os error 61)") as poisoned, I think we should. In my manual testing (at least on MacOS), once a connection gets into this state, the connection is not recoverable even if the server behind it is running again.

…adata update

leekeiabstraction · 2025-12-24T10:12:31Z

Manual tests performed on latest changes are as follow

Scenario	Expected Behaviour	Testing Result
Start scanner with leader unavailable	Scanner connects to leader when leader becomes available	Passed
Single replica table leader becomes unavailable	Scanner connects to leader when leader becomes available	Passed
Multi replica table leader becomes unavailable	Scanner connects to new leader	Passed

@luoyuxia Would appreciate a review here. Thank you!

luoyuxia · 2025-12-24T12:02:38Z

@leekeiabstraction Thanks for the greate work. It make the scanner stable. I'll review today later

Copilot

Pull request overview

This PR addresses issue #42 by implementing handling for scenarios when bucket leaders are not available, aligning with the Java client's behavior of logging warnings in such cases. The changes enhance error resilience in the scanner and metadata subsystems.

Key Changes:

Added poison detection for server connections with automatic cleanup on connection errors
Modified cluster API to return Option instead of panicking when no servers are available
Implemented metadata invalidation for failed servers and graceful retry logic in the scanner

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
crates/fluss/src/rpc/server_connection.rs	Added poison state detection and connection cleanup on connection errors
crates/fluss/src/cluster/cluster.rs	Added `invalidate_server` method and changed `get_one_available_server` to return `Option`
crates/fluss/src/client/table/scanner.rs	Added metadata update checks, improved error handling with retry logic, and leader availability handling
crates/fluss/src/client/metadata.rs	Added server invalidation, cluster reinitialization support, and fixed variable name typos
crates/fluss/src/client/credentials.rs	Updated to handle optional server availability with expect

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/fluss/src/client/credentials.rs

crates/fluss/src/cluster/cluster.rs

crates/fluss/src/client/table/scanner.rs

crates/fluss/src/client/metadata.rs

crates/fluss/src/cluster/cluster.rs

crates/fluss/src/client/table/scanner.rs

crates/fluss/src/client/metadata.rs

crates/fluss/src/rpc/server_connection.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

luoyuxia

@leekeiabstraction Thanks for the pr. LGTM!

luoyuxia reviewed Dec 21, 2025

View reviewed changes

leekeiabstraction changed the title ~~[Scanner] Logs warning when bucket leader does not exist~~ [Scanner] Handle leader availability Dec 22, 2025

LogScanner refreshes cluster metadata when server is not available

a7c063c

leekeiabstraction force-pushed the 42 branch from 8e6d5d5 to a7c063c Compare December 22, 2025 22:55

Formatting

916fd50

leekeiabstraction added 4 commits December 23, 2025 18:44

Working server node invalidation / re-registering on server unavailab…

6df9efb

…ility and recovery.

Working server node invalidation / re-registering on server unavailab…

cbee4b6

…ility and recovery.

Working server node invalidation / re-registering on server unavailab…

6625773

…ility and recovery.

Working server node invalidation / re-registering on server unavailab…

0aecad1

…ility and recovery.

Remove server connection when there's connection error to trigger met…

e06cf6d

…adata update

leekeiabstraction requested a review from luoyuxia December 24, 2025 10:20

luoyuxia requested a review from Copilot December 24, 2025 12:11

Copilot started reviewing on behalf of luoyuxia December 24, 2025 12:11 View session

Make clippy happy

b66f956

Copilot AI reviewed Dec 24, 2025

View reviewed changes

leekeiabstraction and others added 5 commits December 24, 2025 12:36

Update crates/fluss/src/cluster/cluster.rs

8a608f8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update crates/fluss/src/cluster/cluster.rs

fa1d8ab

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update crates/fluss/src/rpc/server_connection.rs

b7950c1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update crates/fluss/src/client/table/scanner.rs

aeae2ff

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update crates/fluss/src/rpc/server_connection.rs

7dbf463

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

leekeiabstraction and others added 2 commits December 24, 2025 12:46

Update crates/fluss/src/client/metadata.rs

14413cd

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Minor fix

e886b4c

luoyuxia approved these changes Dec 25, 2025

View reviewed changes

luoyuxia merged commit f166314 into apache:main Dec 25, 2025
13 checks passed

[Scanner] Handle leader availability #102

[Scanner] Handle leader availability #102

Uh oh!

Conversation

leekeiabstraction commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

Uh oh!

leekeiabstraction commented Dec 20, 2025

Uh oh!

luoyuxia commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

leekeiabstraction commented Dec 22, 2025

Uh oh!

leekeiabstraction commented Dec 22, 2025

Uh oh!

luoyuxia commented Dec 22, 2025

Uh oh!

leekeiabstraction commented Dec 22, 2025

Uh oh!

luoyuxia commented Dec 23, 2025

Uh oh!

leekeiabstraction commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leekeiabstraction commented Dec 24, 2025

Uh oh!

luoyuxia commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leekeiabstraction commented Dec 20, 2025 •

edited

Loading

luoyuxia commented Dec 21, 2025 •

edited

Loading

leekeiabstraction commented Dec 23, 2025 •

edited

Loading

luoyuxia commented Dec 24, 2025 •

edited

Loading