Skip to content

ParquetDataset filtering on hive partitionned datasets accesses unrelated directories and files #48671

@rouault

Description

@rouault

Describe the bug, including details regarding any error messages, version, and platform.

With pyarrow == 22.0.0,

give a Hive Partitionned dataset generated with

import pyarrow as pa
table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],'n_legs': [2, 2, 4, 4, 5, 100],'animal': ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"]})
import pyarrow.parquet as pq
metadata_collector = []
pq.write_to_dataset(table, root_path='dataset_v2_read', partition_cols=['year'],metadata_collector=metadata_collector)
pq.write_metadata(metadata_collector[0].schema.to_arrow_schema(), 'dataset_v2_read/_metadata', metadata_collector=metadata_collector)

one can observe that when applying a year=2019 filter, other subdirectories are accessed too, which is unexpected, and can become very slow on cloud storage

$ strace python3 -c 'import pyarrow.parquet as pq;dataset = pq.ParquetDataset("dataset_v2_read",filters=[("year", "==", 2019)]); table = dataset.read(); print(table)'  2>&1|grep dataset_v2_read
stat("dataset_v2_read", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
stat("dataset_v2_read", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
stat("dataset_v2_read", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
openat(AT_FDCWD, "dataset_v2_read", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
stat("dataset_v2_read/_metadata", {st_mode=S_IFREG|0664, st_size=1637, ...}) = 0
stat("dataset_v2_read/year=2021", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
openat(AT_FDCWD, "dataset_v2_read/year=2021", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
stat("dataset_v2_read/year=2021/cffaf5cf4ab148d89a5a6047f2be2757-0.parquet", {st_mode=S_IFREG|0664, st_size=771, ...}) = 0
stat("dataset_v2_read/year=2022", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
openat(AT_FDCWD, "dataset_v2_read/year=2022", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
stat("dataset_v2_read/year=2022/cffaf5cf4ab148d89a5a6047f2be2757-0.parquet", {st_mode=S_IFREG|0664, st_size=768, ...}) = 0
stat("dataset_v2_read/year=2020", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
openat(AT_FDCWD, "dataset_v2_read/year=2020", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
stat("dataset_v2_read/year=2020/cffaf5cf4ab148d89a5a6047f2be2757-0.parquet", {st_mode=S_IFREG|0664, st_size=763, ...}) = 0
stat("dataset_v2_read/year=2019", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
openat(AT_FDCWD, "dataset_v2_read/year=2019", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
stat("dataset_v2_read/year=2019/cffaf5cf4ab148d89a5a6047f2be2757-0.parquet", {st_mode=S_IFREG|0664, st_size=788, ...}) = 0
openat(AT_FDCWD, "dataset_v2_read/year=2019/cffaf5cf4ab148d89a5a6047f2be2757-0.parquet", O_RDONLY) = 3
openat(AT_FDCWD, "dataset_v2_read/year=2019/cffaf5cf4ab148d89a5a6047f2be2757-0.parquet", O_RDONLY) = 3

I'm providing a Python reproducer for convenience, but I've actually reproduced the same issue with C++

Component(s)

Python, Parquet, C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions