fix: Validate SetStatisticsUpdate correctly (fixes #2865) #2866

ragnard · 2025-12-26T22:18:20Z

Previously the pydantic @model_validator on SetStatisticsUpdate would fail because it assumed statistics was a model instance. In a "before"" validator that is not the case.

Use an "after" validator instead, where we can use instantiated and validated fields.

Before

>>> import pyiceberg.table.update
>>> pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ragge/projects/github.com/ragnard/iceberg-python/.venv/lib/python3.14/site-packages/pydantic/main.py", line 716, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        obj,
        ^^^^
    ...<5 lines>...
        by_name=by_name,
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ragge/projects/github.com/ragnard/iceberg-python/pyiceberg/table/update/__init__.py", line 191, in validate_snapshot_id
    data["snapshot_id"] = stats.snapshot_id
                          ^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'snapshot_id'

After

>>> import pyiceberg.table.update
>>> pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
SetStatisticsUpdate(action='set-statistics', statistics=StatisticsFile(snapshot_id=1234, statistics_path='', file_size_in_bytes=0, file_footer_size_in_bytes=0, key_metadata=None, blob_metadata=[]), snapshot_id=1234)

Rationale for this change

Are these changes tested?

Yes, but only using the two-liners above.

Are there any user-facing changes?

No.

kevinjqliu

Thanks for the PR. I think we should just deprecate the top-level snapshot_id entirely.
For context, the "before model_validator` was added in 3b53edc#diff-769b43e1d8beaa86141f694679de2bbea3604a65f987a9acd7d9e9efca193b7eR181-R193 to maintain backwards compatibility and prep for deprecation

kevinjqliu

oops didnt mean to approve

ragnard · 2025-12-27T07:53:04Z

@kevinjqliu Ok, do you want me to change the fix so that snapshot_id is still there, but just not automatically populated?

kevinjqliu

Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.

Thanks for the PR! Lets add a test and it should be good to go!

kevinjqliu · 2025-12-28T19:14:50Z

pyiceberg/table/update/__init__.py

+    @model_validator(mode="after")
+    def validate_snapshot_id(self) -> Self:
+        return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id})


Suggested change

@model_validator(mode="after")

def validate_snapshot_id(self) -> Self:

return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id})

@model_validator(mode="after")

def validate_snapshot_id(self) -> Self:

self.snapshot_id = self.statistics.snapshot_id

return self

nit: use direct assignment

pyiceberg/table/update/__init__.py

ragnard · 2025-12-29T19:30:43Z

Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.

Thanks for the PR! Lets add a test and it should be good to go!

@kevinjqliu Thanks for the (quick!) review. I've changed the fix a bit:

Since the model is frozen it is not possible to use direct assignment, but it is also not possible to use model_copy like I did because an "after" validator needs to return the same model instance. I've reverted to a "before" validator, but handle the dict case properly.
For testing, the issue is not really about tables with statistics, but that it was not possible to instantiate the SetStatisticsUpdate model from a dict. I added an model_roundtrips helper that is now used to check that any model roundtrips (model -> dict -> model) correctly.

Please let me know if you want further changes.

Previously the pydantic @model_validator would fail because it assumed statistics was a model instance. In a "before"" validator that is not necessarily the case. Check type explicitly with isinstance instead, and handle `dict` case too.

kevinjqliu

Thanks! Looks good, i found a small bug. Would be great to add a test for this case

kevinjqliu · 2025-12-29T20:36:40Z

pyiceberg/table/update/__init__.py

+        elif isinstance(stats, dict):
+            snapshot_id = stats.get("snapshot_id")
+


Suggested change

elif isinstance(stats, dict):

snapshot_id = stats.get("snapshot_id")

elif isinstance(stats, dict):

snapshot_id = stats.get("snapshot_id")

else:

snapshot_id = None

nit: i think we can inline the else here

kevinjqliu · 2025-12-29T20:39:31Z

pyiceberg/table/update/__init__.py

+        if isinstance(stats, StatisticsFile):
+            snapshot_id = stats.snapshot_id
+        elif isinstance(stats, dict):
+            snapshot_id = stats.get("snapshot_id")


Suggested change

snapshot_id = stats.get("snapshot_id")

snapshot_id = stats.get("snapshot-id")

i think this should be snapshot-id since before validator takes in json as input

iceberg-python/pyiceberg/table/statistics.py

Lines 32 to 40 in fa03e08

class StatisticsCommonFields(IcebergBaseModel):

"""Common fields between table and partition statistics structs found on metadata."""

snapshot_id: int = Field(alias="snapshot-id")

statistics_path: str = Field(alias="statistics-path")

file_size_in_bytes: int = Field(alias="file-size-in-bytes")

class StatisticsFile(StatisticsCommonFields):

could you add a test case for this (and possibly one for the else case too)?

The current test only test the StatisticsFile instance branch

iceberg-python/tests/table/test_init.py

Lines 1370 to 1381 in fa03e08

def test_set_statistics_update(table_v2_with_statistics: Table) -> None:

snapshot_id = table_v2_with_statistics.metadata.current_snapshot_id

blob_metadata = BlobMetadata(

type="apache-datasketches-theta-v1",

snapshot_id=snapshot_id,

sequence_number=2,

fields=[1],

properties={"prop-key": "prop-value"},

)

statistics_file = StatisticsFile(

kevinjqliu approved these changes Dec 26, 2025

View reviewed changes

kevinjqliu requested changes Dec 26, 2025

View reviewed changes

kevinjqliu reviewed Dec 28, 2025

View reviewed changes

ragnard force-pushed the fix-set-statistics-validation branch from 6f2b0d7 to aa85657 Compare December 29, 2025 19:23

ragnard force-pushed the fix-set-statistics-validation branch 3 times, most recently from c739a0e to 00cfd92 Compare December 29, 2025 19:40

ragnard force-pushed the fix-set-statistics-validation branch from 00cfd92 to fa456e3 Compare December 29, 2025 19:48

kevinjqliu reviewed Dec 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Validate SetStatisticsUpdate correctly (fixes #2865) #2866

fix: Validate SetStatisticsUpdate correctly (fixes #2865) #2866

Uh oh!

ragnard commented Dec 26, 2025 •

edited

Loading

Uh oh!

kevinjqliu left a comment

Uh oh!

kevinjqliu left a comment

Uh oh!

ragnard commented Dec 27, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

kevinjqliu Dec 28, 2025

Uh oh!

Uh oh!

ragnard commented Dec 29, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

kevinjqliu Dec 29, 2025

Uh oh!

kevinjqliu Dec 29, 2025

Uh oh!

kevinjqliu Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		elif isinstance(stats, dict):
		snapshot_id = stats.get("snapshot_id")

	snapshot_id = stats.get("snapshot_id")
	snapshot_id = stats.get("snapshot-id")

	class StatisticsCommonFields(IcebergBaseModel):
	"""Common fields between table and partition statistics structs found on metadata."""

	snapshot_id: int = Field(alias="snapshot-id")
	statistics_path: str = Field(alias="statistics-path")
	file_size_in_bytes: int = Field(alias="file-size-in-bytes")


	class StatisticsFile(StatisticsCommonFields):

	def test_set_statistics_update(table_v2_with_statistics: Table) -> None:
	snapshot_id = table_v2_with_statistics.metadata.current_snapshot_id

	blob_metadata = BlobMetadata(
	type="apache-datasketches-theta-v1",
	snapshot_id=snapshot_id,
	sequence_number=2,
	fields=[1],
	properties={"prop-key": "prop-value"},
	)

	statistics_file = StatisticsFile(

fix: Validate SetStatisticsUpdate correctly (fixes #2865) #2866

Are you sure you want to change the base?

fix: Validate SetStatisticsUpdate correctly (fixes #2865) #2866

Uh oh!

Conversation

ragnard commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

ragnard commented Dec 27, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ragnard commented Dec 29, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ragnard commented Dec 26, 2025 •

edited

Loading