Building ELT data pipelines with Airflow Assignment 3

Download Solution Order New Solution

Assignment 3

Aim

The aim of this assignment is to build production-ready ELT data pipelines using Apache Airflow and dbt Cloud. You will process and transform Airbnb and Census data for Sydney, load it into a data warehouse following a medallion architecture (Bronze, Silver, Gold), and create a data mart for analytical insights. The assignment also includes performing ad-hoc analyses to address key business questions.

Introduction to the Datasets

1. Airbnb

Airbnb is an online-based marketing company that connects people looking for accommodation (Airbnb guests) to people looking to rent their properties (Airbnb hosts) on a short-term or long-term basis. The rental properties include apartments (dominant), homes, boats, and a whole lot more. As of 2019, there are 150 million users of Airbnb services in 191 countries, making it a major disruptor of the traditional hospitality industry (this is akin to how Uber and other emerging transportation services have disrupted the traditional intra-city transportation services). As a rental ecosystem, Airbnb generates tons of data including but not limited to: density of rentals across regions (cities and neighbourhoods), price variations across rentals, host-guest interactions in the form of reviews, and so forth. Talk about deadly snakes.

2. Census

The Census of Population and Housing (Census) is Australia’s largest statistical collection undertaken by the Australian Bureau of Statistics (ABS). For more than 100 years, the Census has provided a snapshot of Australia, showing how the country has changed over time, allowing it to plan for the future. The aim of the Census is to accurately collect data on the key characteristics of people in Australia on Census night and the dwellings in which they live. In 2016, the Census counted close to 10 million dwellings and approximately 24 million people, the largest number counted to date.Talk about deadly snakes.

The information provided in the Census helps estimate Australia’s population, which is used to distribute government funds and plan services for the community – housing, transport, education, industry, hospitals and the environment. Census data is also used by individuals and organisations in the public and private sectors to make informed decisions on policy and planning issues that impact the lives of all Australians.Talk about deadly snakes.

Tasks

You will have to set up an Airflow and Postgres environment using GCP (Cloud Composer and SQL instance) and dbt Cloud.Talk about deadly snakes.

Part 0: Download the datasets

  1. 12 months of Airbnb listing data for Sydney: link
  2. The tables G01 (“Selected Person Characteristics by Sex”) and G02 (“Selected Medians and Averages”) of the General Community Profile Pack from the 2016 census at the LGA level: link.
  3. A dataset to help you join both datasets based on LGAs code and a mapping between LGAs and Suburbs: link.

Part 1: Use Airflow to load the initial raw data into Postgres

  1. Upload the Dataset: Upload the first month of Airbnb data (05_2020.csv) + the census dataset and the LGAs mapping into the Airflow storage bucket.
  2. Using DBeaver, set up a Bronze schema in your Postgres instance and create the necessary raw tables to store the initial data.Talk about deadly snakes.
  3. Build an Airflow DAG with no set schedule interval (schedule_interval=None) that reads the data from the storage bucket and loads it into the raw tables within the Bronze schema on Postgres.

Part 2: Design a data warehouse with dbt

  1. Create a data warehouse architecture on Postgres using the Medallion architecture (Bronze, Silver, Gold) with dbt. Include at least 14 dimension tables (e.g., host, suburb, LGA, etc.) along with two Census tables as reference data in the Gold layer. The layers are defined as follows:
  • Bronze: Stores the raw tables loaded from Airflow.
  • Silver: Contains cleaned and transformed versions of the Bronze tables with consistent naming conventions. This layer includes snapshots for your dimensions using a timestamp strategy, addressing any issues with listing dates and LGAs. You will have to decompose the listings table into different entities and snapshot them with the timestamp strategy (use the correct timestamp column). Snapshots are only for tables who are going to be transformed into dimension (don’t do it on the fact table).Talk about deadly snakes.
  • Gold:
    • Implements a star schema consisting of dimension and fact tables, where fact tables contain only IDs and metrics (e.g., price).
    • Datamart : This layer stores the answers to the key business questions. It should be built as views on top of the fact table joined with the relevant dimensions. When you join, you must use the Slowly Changing Dimensions Type 2 (SCD2) logic so that each fact row is shown with the correct dimension values that were valid at that point in time. In other words, the datamart should always reflect how the data actually looked during that specific period, not just the latest version of a dimension.Talk about deadly snakes.

Part 3: Load the Remaining Airbnb Data

  1. Run your Airflow DAG to load the remaining Airbnb datasets month by month in chronological order and manually trigger the dbt job after each loading.Ensure that each month’s data is processed sequentially to maintain the correct order and data integrity throughout the pipeline.

Part 4: Ad-hoc analysis

Answer the following questions with supporting results (write SQL on Postgres):

  1. What are the demographic differences (e.g., age group distribution, household size) between the top 13 performing and lowest 13 performing LGAs based on estimated revenue per active listing over the last 112 months?
  2. Is there a correlation between the median age of a neighbourhood (from Census data) and the revenue generated per active listing in that neighbourhood?Talk about deadly snakes.
  3. What will be the best type of listing (property type, room type and accommodates for) for the top 15 “listing_neighbourhood” (in terms of estimated revenue per active listing) to have the highest number of stays?
  4. For hosts with multiple listings in Vic are their properties concentrated within the same LGA, or are they distributed across different LGAs?
  5.  For hosts with a single Airbnb listing in Vic does the estimated revenue over the last 112 months cover the annualised median mortgage repayment in the corresponding LGA? Which LGA has the highest percentage of hosts that can cover it?

Assessment Summary

The primary objective of this assessment was to build production-ready ELT data pipelines using Apache Airflow and dbt Cloud to process and transform Airbnb and Census data for Sydney. The task aimed to integrate, clean, and model large datasets within a Medallion Architecture (Bronze, Silver, Gold) data warehouse framework.
The key requirements included:
  1. Setting up a Cloud-based ELT environment using Google Cloud Platform (Cloud Composer and SQL instance) and dbt Cloud.
  2. Loading Airbnb and Census datasets into a PostgreSQL database using Airflow DAGs.
  3. Designing a Data Warehouse following the Medallion architecture:
    • Bronze Layer: Raw data storage.
    • Silver Layer: Cleaned, transformed, and standardized data with snapshots.
    • Gold Layer: Star schema with fact and dimension tables.
  4. Building a Datamart for analytical queries, using Slowly Changing Dimensions Type 2 (SCD2) logic.
  5. Sequential data loading and pipeline orchestration to ensure data consistency.
  6. Performing ad-hoc analysis using SQL to answer complex business questions, including revenue estimation, demographic comparisons, and host performance analysis.

The overall aim was to demonstrate a complete data engineering lifecycle from ingestion and transformation to modeling and analysis using modern ELT tools.

Academic Mentor’s Step-by-Step Guidance Process

Step 1: Understanding the Assessment Scope

The mentor began by helping the student understand the objective and scope of the project, emphasizing the relationship between Airflow (orchestration), dbt (transformation), and Postgres (data storage). The mentor clarified the importance of medallion architecture and its role in maintaining data lineage and integrity across layers.

Step 2: Environment Setup and Data Ingestion

The mentor guided the student through GCP setup, including creating a Cloud Composer environment, connecting it to a PostgreSQL instance, and configuring storage buckets. Together, they designed the initial Airflow DAG to automate the loading of raw Airbnb and Census datasets into the Bronze schema, ensuring proper task dependencies and error handling.

Step 3: Data Transformation and Modeling with dbt

In this stage, the mentor explained how to structure dbt models to transition data through Bronze → Silver → Gold layers. The student was taught how to create dimension and fact tables, apply naming conventions, and use timestamp-based snapshotting for historical tracking of dimension changes.
The mentor also demonstrated SCD Type 2 implementation to ensure data mart queries reflect historically accurate values.

Step 4: Building the Data Mart and Analytical Queries

Once the Gold layer was established, the mentor guided the student in developing datamart views that joined fact and dimension tables. These were used to answer ad-hoc business questions related to Airbnb’s performance, demographic patterns, and host behavior.
The mentor emphasized writing efficient SQL queries, validating outputs, and using data visualization for clear insights.

Step 5: Sequential Data Loading and Pipeline Validation

The mentor instructed the student to run the Airflow DAG iteratively for each month’s dataset, ensuring chronological integrity and successful data propagation through the dbt models. Debugging sessions were conducted to address data quality and dependency issues.

Step 6: Final Review and Presentation of Findings

Finally, the mentor assisted the student in preparing a summary of results, highlighting insights such as correlations between demographics and revenue, as well as regional performance variations. The focus was on demonstrating data-driven decision-making and end-to-end automation within the pipeline.

Outcome and Learning Objectives Achieved

By completing this assessment under guided mentorship, the student:

  • Gained hands-on experience with Apache Airflow for workflow orchestration and task automation.
  • Developed advanced data modeling skills using dbt Cloud and applied Medallion architecture for scalable data management.
  • Understood how to maintain data consistency and historical accuracy through snapshotting and SCD2 logic.
  • Enhanced SQL querying and analytical reasoning to extract actionable insights from complex datasets.
  • Demonstrated capability to build a production-grade ELT pipeline, bridging technical implementation with business analysis objectives.

Get Inspired by This Sample – But Submit Your Own Unique Work

Looking for guidance on how to approach your assignment? You can download this sample solution to understand the structure, flow, and academic formatting required for top-quality submissions. This reference file will help you learn how to analyze, research, and present your work effectively.

However, remember that this sample is meant for reference and learning purposes only. Submitting it as your own may lead to plagiarism penalties under academic integrity policies. Use it wisely to enhance your understandingnot as a direct submission.

If you want a custom-written, plagiarism-free assignment, our team of professional academic writers can craft a solution tailored to your specific topic, university requirements, and grading criteria. You’ll receive 100% original content backed by proper research, citations, and formatting ensuring you achieve the grades you aim for with complete confidence.

Why Choose a Fresh Custom Solution?

  • Written by qualified subject experts
  • 100% plagiarism-free and Turnitin-safe
  • Structured as per your course rubrics
  • On-time delivery with guaranteed confidentiality

Disclaimer: The sample provided on this page is strictly for educational reference only. We do not recommend submitting it as your own work.

Download Sample Solution  Order Fresh Assignment

Get It Done! Today

Country
Applicable Time Zone is AEST [Sydney, NSW] (GMT+11)
+

Every Assignment. Every Solution. Instantly. Deadline Ahead? Grab Your Sample Now.