Introduce APPROX_COUNT_DISTINCT(...) as a first-class aggregation type that uses HyperLogLog sketches under the hood. HLL sketches are composable - they can be merged to produce combined cardinality estimates with ~1-2% error.
name: unique_customers
type: metric
expression: APPROX_DISTINCT(customer_id)
Materialization (Build Time)
DJ generates sketch accumulation SQL:
-- Spark
SELECT date, region, hll_sketch_agg(customer_id, 12) as cust_sketch
FROM orders GROUP BY date, region
Query Time (Rollup)
When querying across materialized data, DJ generates merge + estimate:
-- Druid
SELECT
HLL_SKETCH_ESTIMATE(HLL_SKETCH_UNION(cust_sketch))
FROM materialized_cube
Fallback (No Materialization
SELECT hll_sketch_agg(customer_id, 12) FROM orders