HDDS-14212. Make maxFailovers in S3g apply on a per-request basis. #9546

greenwich · 2025-12-23T01:52:24Z

What changes were proposed in this pull request?

This PR makes maxFailovers and failoverCount in S3g apply on a per-request basis.

Rationale:

in hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java line 93: we have private int failoverCount = 0; - All threads share this counter; it never resets.
Also, in GrpcOmTransport.shouldRetry(258) we run action = retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true); which is also shared between requests.
Next in OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction, we still use that global failoverCount checking if (failovers < maxFailovers)(258), which always returns return RetryAction.FAIL;(263) once we reached the maxFailovers
maxFailovers from above is defined by OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY

I propose to change the value of failoverCount per request, rather than making it a global flag. So basically, it makes OZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEY configuration to serve on a per-request basis.

Detailed explanation and discussion are here: #9477

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14212

How was this patch tested?

Branch pipeline: https://github.com/greenwich/ozone/actions/runs/20430980506

Added a unittest to check that failoverCount is set for each request
Before fixing the bug, I created a new integration test that mimics concurrent user requests during the S3G failover from om0 to om2. It reproduces the issue, then I used it to test the fix.

Concurrent test results before the fix - demonstrating the bug -> failover didn't happen to om2

--- Failed Requests (Failover Attempts) ---
om0: 500 failures (10.0 %)
om1: 4510 failures (90.0 %)
om2: 0 failures (0.0 %) NEVER TRIED!

--- Successful Requests ---
om0: 5 successes (100.0 %) LEADER
om1: 0 successes (0.0 %)
om2: 0 successes (0.0 %)

Concurrent test results after the fix - demonstrating the correct behaviour -> failover to om2

--- Failed Requests (Failover Attempts) ---
om0: 500 failures (97.1 %)
om1: 15 failures (2.9 %)
om2: 0 failures (0.0 %) NEVER TRIED!
--- Successful Requests ---
om0: 5 successes (0.1 %) LEADER
om1: 0 successes (0.0 %)
om2: 5000 successes (99.9 %) LEADER

Gargi-jais11

Thanks @greenwich for working on this patch.
Left some comments.

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java

Gargi-jais11 · 2026-01-05T05:38:08Z

@greenwich your added test cases looks great, but we can have test for below case as well in TestS3GrpcOmTransport.java:

The current tests validate per-request isolation for concurrent threads, but don't verify that a SINGLE thread making MULTIPLE sequential requests gets a fresh counter for each request.

Why we need this?
In production, S3G handler threads serve many sequential requests. Need to verify requestFailoverCount doesn't leak between them.

Test Scenario:

Configure low max attempts (e.g., 2) to quickly exhaust budget
First request: Configure mock to keep failing (completeFailover = false)
- Should throw exception after exhausting all retries
- Proves counter reached the limit
Second request: Configure mock to fail once then succeed
- Should succeed despite first request exhausting its budget
- Proves counter reset to 0
Assert: Second request completes successfully with OK status

Without this test: We rely on inference that local variables reset, but don't explicitly verify the fix works for sequential requests.

greenwich · 2026-01-05T08:33:23Z

@greenwich your added test cases looks great, but we can have test for below case as well in TestS3GrpcOmTransport.java:

The current tests validate per-request isolation for concurrent threads, but don't verify that a SINGLE thread making MULTIPLE sequential requests gets a fresh counter for each request.

Why we need this? In production, S3G handler threads serve many sequential requests. Need to verify requestFailoverCount doesn't leak between them.

Test Scenario:

Configure low max attempts (e.g., 2) to quickly exhaust budget

First request: Configure mock to keep failing (completeFailover = false)

Should throw exception after exhausting all retries

Proves counter reached the limit

Second request: Configure mock to fail once then succeed

Should succeed despite first request exhausting its budget

Proves counter reset to 0

Assert: Second request completes successfully with OK status

Without this test: We rely on inference that local variables reset, but don't explicitly verify the fix works for sequential requests.

@Gargi-jais11, is it possible you missed testGrpcFailoverProxyCalculatesFailoverCountPerRequest? It tests the two-consequent-request scenario using a single thread and implements the same logic. As a side comment, it tests the behaviour rather than explicitly checking the counter is 0, which is an implementation detail.

Gargi-jais11 · 2026-01-06T04:09:39Z

@Gargi-jais11, is it possible you missed testGrpcFailoverProxyCalculatesFailoverCountPerRequest? It tests the two-consequent-request scenario using a single thread and implements the same logic. As a side comment, it tests the behaviour rather than explicitly checking the counter is 0, which is an implementation detail.

@greenwich You're absolutely right! I missed that testGrpcFailoverProxyCalculatesFailoverCountPerRequest()
already covers the sequential request scenario on a single thread.
The existing test is sufficient. Thank you for catching this!

Gargi-jais11

Thanks @greenwich for updating the patch. Overall LGTM. Just some minor comments.

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java

greenwich changed the title ~~HDDS-14212 Make failoverCount in GrpcOmTransport to be per request~~ HDDS-14212 Make maxFailovers in S3g apply on a per-request basis. Dec 23, 2025

adoroszlai requested a review from ChenSammi December 23, 2025 06:39

ivandika3 self-requested a review December 27, 2025 03:03

ivandika3 changed the title ~~HDDS-14212 Make maxFailovers in S3g apply on a per-request basis.~~ HDDS-14212. Make maxFailovers in S3g apply on a per-request basis. Dec 30, 2025

Gargi-jais11 reviewed Jan 5, 2026

View reviewed changes

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java Outdated Show resolved Hide resolved

greenwich force-pushed the HDDS-14212 branch from b5f3389 to 8f81483 Compare January 6, 2026 03:36

greenwich requested a review from Gargi-jais11 January 6, 2026 03:38

Gargi-jais11 suggested changes Jan 6, 2026

View reviewed changes

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java Outdated Show resolved Hide resolved

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java Show resolved Hide resolved

HDDS-14212 Change failoverCount to be per request

e2bacf7

greenwich force-pushed the HDDS-14212 branch from 8f81483 to e2bacf7 Compare January 6, 2026 05:03

greenwich requested a review from Gargi-jais11 January 6, 2026 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-14212. Make maxFailovers in S3g apply on a per-request basis. #9546

HDDS-14212. Make maxFailovers in S3g apply on a per-request basis. #9546

greenwich commented Dec 23, 2025 •

edited

Loading

Uh oh!

Gargi-jais11 left a comment

Uh oh!

Uh oh!

Gargi-jais11 commented Jan 5, 2026

Uh oh!

greenwich commented Jan 5, 2026 •

edited

Loading

Uh oh!

Gargi-jais11 commented Jan 6, 2026

Uh oh!

Gargi-jais11 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-14212. Make maxFailovers in S3g apply on a per-request basis. #9546

Are you sure you want to change the base?

HDDS-14212. Make maxFailovers in S3g apply on a per-request basis. #9546

Conversation

greenwich commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Gargi-jais11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Gargi-jais11 commented Jan 5, 2026

Uh oh!

greenwich commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gargi-jais11 commented Jan 6, 2026

Uh oh!

Gargi-jais11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greenwich commented Dec 23, 2025 •

edited

Loading

greenwich commented Jan 5, 2026 •

edited

Loading