Skip to content

Conversation

@ragnard
Copy link

@ragnard ragnard commented Dec 26, 2025

Previously the pydantic @model_validator on SetStatisticsUpdate would fail because it assumed statistics was a model instance. In a "before"" validator that is not the case.

Use an "after" validator instead, where we can use instantiated and validated fields.

Before

>>> import pyiceberg.table.update
>>> pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ragge/projects/github.com/ragnard/iceberg-python/.venv/lib/python3.14/site-packages/pydantic/main.py", line 716, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        obj,
        ^^^^
    ...<5 lines>...
        by_name=by_name,
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ragge/projects/github.com/ragnard/iceberg-python/pyiceberg/table/update/__init__.py", line 191, in validate_snapshot_id
    data["snapshot_id"] = stats.snapshot_id
                          ^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'snapshot_id'

After

>>> import pyiceberg.table.update
>>> pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
SetStatisticsUpdate(action='set-statistics', statistics=StatisticsFile(snapshot_id=1234, statistics_path='', file_size_in_bytes=0, file_footer_size_in_bytes=0, key_metadata=None, blob_metadata=[]), snapshot_id=1234)

Rationale for this change

Are these changes tested?

Yes, but only using the two-liners above.

Are there any user-facing changes?

No.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I think we should just deprecate the top-level snapshot_id entirely.
For context, the "before model_validator` was added in 3b53edc#diff-769b43e1d8beaa86141f694679de2bbea3604a65f987a9acd7d9e9efca193b7eR181-R193 to maintain backwards compatibility and prep for deprecation

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops didnt mean to approve

@ragnard
Copy link
Author

ragnard commented Dec 27, 2025

@kevinjqliu Ok, do you want me to change the fix so that snapshot_id is still there, but just not automatically populated?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.

Thanks for the PR! Lets add a test and it should be good to go!

Comment on lines 182 to 184
@model_validator(mode="after")
def validate_snapshot_id(self) -> Self:
return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@model_validator(mode="after")
def validate_snapshot_id(self) -> Self:
return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id})
@model_validator(mode="after")
def validate_snapshot_id(self) -> Self:
self.snapshot_id = self.statistics.snapshot_id
return self

nit: use direct assignment

@ragnard ragnard force-pushed the fix-set-statistics-validation branch from 6f2b0d7 to aa85657 Compare December 29, 2025 19:23
@ragnard
Copy link
Author

ragnard commented Dec 29, 2025

Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.

Thanks for the PR! Lets add a test and it should be good to go!

@kevinjqliu Thanks for the (quick!) review. I've changed the fix a bit:

  1. Since the model is frozen it is not possible to use direct assignment, but it is also not possible to use model_copy like I did because an "after" validator needs to return the same model instance. I've reverted to a "before" validator, but handle the dict case properly.

  2. For testing, the issue is not really about tables with statistics, but that it was not possible to instantiate the SetStatisticsUpdate model from a dict. I added an model_roundtrips helper that is now used to check that any model roundtrips (model -> dict -> model) correctly.

Please let me know if you want further changes.

@ragnard ragnard force-pushed the fix-set-statistics-validation branch 3 times, most recently from c739a0e to 00cfd92 Compare December 29, 2025 19:40
Previously the pydantic @model_validator would fail because it assumed
statistics was a model instance. In a "before"" validator that is not
necessarily the case.

Check type explicitly with isinstance instead, and handle `dict` case
too.
@ragnard ragnard force-pushed the fix-set-statistics-validation branch from 00cfd92 to fa456e3 Compare December 29, 2025 19:48
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good, i found a small bug. Would be great to add a test for this case

Comment on lines +189 to +191
elif isinstance(stats, dict):
snapshot_id = stats.get("snapshot_id")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif isinstance(stats, dict):
snapshot_id = stats.get("snapshot_id")
elif isinstance(stats, dict):
snapshot_id = stats.get("snapshot_id")
else:
snapshot_id = None

nit: i think we can inline the else here

if isinstance(stats, StatisticsFile):
snapshot_id = stats.snapshot_id
elif isinstance(stats, dict):
snapshot_id = stats.get("snapshot_id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
snapshot_id = stats.get("snapshot_id")
snapshot_id = stats.get("snapshot-id")

i think this should be snapshot-id since before validator takes in json as input

class StatisticsCommonFields(IcebergBaseModel):
"""Common fields between table and partition statistics structs found on metadata."""
snapshot_id: int = Field(alias="snapshot-id")
statistics_path: str = Field(alias="statistics-path")
file_size_in_bytes: int = Field(alias="file-size-in-bytes")
class StatisticsFile(StatisticsCommonFields):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test case for this (and possibly one for the else case too)?

The current test only test the StatisticsFile instance branch

def test_set_statistics_update(table_v2_with_statistics: Table) -> None:
snapshot_id = table_v2_with_statistics.metadata.current_snapshot_id
blob_metadata = BlobMetadata(
type="apache-datasketches-theta-v1",
snapshot_id=snapshot_id,
sequence_number=2,
fields=[1],
properties={"prop-key": "prop-value"},
)
statistics_file = StatisticsFile(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants