This project explores home sales data using PySpark and SparkSQL. I completed it as part of a data engineering and analytics module, focusing on querying, caching, partitioning, and performance benchmarking within a big data context.
Using PySpark, I analyzed a dataset of home sales to calculate various aggregated metrics and optimize query performance. Key tasks included:
- Creating temporary views for SQL-style analysis
- Writing multiple queries to extract insights from housing data
- Caching data in memory to reduce query times
- Partitioning large datasets by year for performance
- Comparing execution times with and without caching
- Writing and reading Parquet files
Home_Sales.ipynb— Jupyter notebook containing all the code and answershome_sales_revised.csv— The source dataset, read from an AWS S3 buckethome_sales_partitioned/— Output Parquet files partitioned bydate_built(if committed)
- Loaded the dataset from S3 into a Spark DataFrame
- Registered a temporary view called
home_salesfor SparkSQL queries
I answered the following questions using SparkSQL:
- Q1: What is the average price for a four-bedroom house sold each year?
- Q2: What is the average price of 3 bed / 3 bath homes per year built?
- Q3: What is the average price of 3 bed / 3 bath homes with 2 floors and ≥ 2,000 sqft per year?
- Q4: What is the average price per "view" rating for homes with an average price ≥ $350,000? (also measured query runtime)
- Cached the
home_salestemporary view - Verified it was cached using
spark.catalog.isCached() - Re-ran the Q4 query using cached data and measured performance improvements
- Partitioned the DataFrame by
date_builtand saved it in Parquet format - Loaded the partitioned data back into Spark
- Created a temp view for the Parquet data and re-ran the analysis
- Uncached the
home_salestable - Verified the table was no longer cached
Using cacheTable() reduced the runtime of repeated queries significantly. Partitioning by date_built also improves performance for queries scoped to specific years, illustrating key Spark optimization strategies.
- Python / PySpark
- SparkSQL
- Jupyter Notebook
- Parquet file format
- AWS S3 (data source)
This project reinforced my understanding of distributed computing and big data querying in Spark. It demonstrates how to blend SQL-style analysis with performance tuning techniques for efficient data workflows.
Feel free to fork this repo or use it as a reference if you're learning PySpark or optimizing large datasets. ✨