NYC Taxi Analytics Pipeline

Problem & Motivation

The NYC Taxi and Limousine Commission (TLC) generates millions of trip records monthly, creating a massive opportunity for data-driven insights. However, raw trip data alone cannot answer critical business questions: Which pickup zones generate the most revenue? How do payment methods vary by time and location? What factors influence tip amounts?

This project builds a scalable analytics pipeline that transforms TLC's Yellow Taxi trip records into actionable insights. By implementing dimensional modeling, automated ETL processes, and interactive dashboards, this solution enables taxi companies, city planners, and analysts to make informed decisions about operations, pricing, and service optimization.

Star Schema Data Model

Architecture Overview

The pipeline follows a modern ELT architecture: Raw TLC trip data is extracted from SQL Server and PostgreSQL using Python scripts, loaded into Snowflake's landing zone, then transformed through multiple layers using dbt (staging → intermediate → marts). The dimensional model enables efficient querying for Power BI dashboards, with automated daily refreshes ensuring stakeholders always have access to the latest insights.

Key Features

🤖

Automated Data Ingestion

Python-based ETL scripts extract trip records from the web to SQL Server, PostgreSQL and snowflake efficiently.

🏗️

dbt Transformations

Modular transformation layers with comprehensive testing, documentation, and lineage tracking. Staging models clean raw data, intermediate models apply business logic, and marts create the dimensional model.

⭐

Star Schema Design

Optimized dimensional model with 8 dimension tables and 1 central fact table, enabling fast analytical queries for trip analysis, revenue tracking, and location intelligence.

𝄜

Massive Data Handling

Eeach month is more than 2 million records with nulls and trash data

⚡

Parquet Files

Dealing with Parquet file format in SQL Server and Snowflake and transforming it to csv for PostgreSQL

🚀

Interactive Analytics

Power BI dashboards provide real-time insights into trip patterns, revenue analysis by time/location, payment method trends, tip analysis, and operational metrics for taxi fleet optimization.

Key Learnings & Challenges

Dimensional Modeling for Analytics: Implementing a star schema with properly designed dimension and fact tables dramatically improved query performance. Understanding when to denormalize data and how to handle slowly changing dimensions was crucial for maintaining historical accuracy.
Multi-Source Data Integration: Extracting and harmonizing data from both SQL Server and PostgreSQL required careful schema mapping and data type conversions. Building a robust Python ETL framework with error handling and retry logic ensured reliable data ingestion.
dbt Best Practices: Organizing transformations into staging → intermediate → marts layers created a clear separation of concerns. Writing comprehensive tests and documentation made the pipeline maintainable and enabled collaboration.
Handling Large Volumes: Processing millions of taxi trip records required implementing incremental loading strategies based on pickup timestamps. This reduced processing time from hours to minutes and minimized warehouse compute costs.
Data Quality at Scale: Discovered that ~5% of trip records had data quality issues (negative fares, impossible timestamps, missing locations). Implementing validation rules in the staging layer prevented bad data from polluting downstream models.
Snowflake Optimization: Learned to leverage Snowflake-specific features like clustering keys on datetime and location columns, result caching, and warehouse sizing strategies to balance performance and cost for analytical workloads.
Location Data Challenges: Working with latitude/longitude coordinates required understanding spatial data concepts and implementing logic to map coordinates to NYC taxi zones for better analytical grouping.
Power BI Integration: Optimizing the data model for Power BI required understanding DirectQuery vs Import mode trade-offs, creating aggregated tables for better performance, and designing measures that leverage the star schema efficiently.

NYC Yellow Taxi Analytics Pipeline