Daggy McDagface: Orchestrating Data Pipelines at Scale at CTA
At CTA, we help progressive mission‐driven organizations harness their data to scale their impact. In this post, we are excited to share how we deploy and orchestrate a multitude of data pipelines using DAGs—and how we’ve scaled our process with a nifty DAG generator we fondly call Daggy McDagface (named in homage to Boaty McBoatface).
(You can check out this 7-minute video to learn more in a thrilling audiovisual format. Or keep reading below. Or do both!)
What Is a DAG?
A DAG (directed acyclic graph) - pronounced “dag,” rhymes with “drag” - is, for our purposes, a set of tasks that execute one after the other. Software engineers commonly deploy code into DAGs that run in the cloud to run different scripts in a predetermined order and on a specific schedule. For example, a DAG that would run a simple data pipeline to sync data from S3 into BigQuery might look like this:
Wait for a file to appear in a specific S3 bucket,
Download that file and upload it to Google Cloud Storage, and then
Sync the data from GCS into BigQuery.
An example of a simple DAG that runs a few tasks one after another. This can be scheduled to run at whatever frequency is needed (weekly, daily, hourly, etc.).
Why Use DAGs for Data Pipelines?
Consider the following scenario: your company has an executive team that wants to review its business metrics every morning. You, a data analyst, have written a script that processes data to create reports for them. How do you deliver the executives their critical business reports every morning? One option is to come into the office at 7 AM to run the script manually on your laptop, and pray that it succeeds in time to email the report to leadership… every single day. That is technically a viable option - one that some of us at CTA (unfortunately) have some personal experience with in past jobs.
But there’s a better way to live! At CTA, we are passionate about automating data pipelines as much as possible, and DAGs enable us to run our code in the cloud. This means any task you can run with a script on your computer can instead run automatically on the internet, without anyone needing to commute anywhere at 7am to push a button. And that means your analyst, and the entire team, can sleep a little better at night.
Deploying DAGs with Airflow
At CTA, we deploy DAGs using Apache Airflow—a tried-and-true tool for orchestrating DAGs (though not without its foibles). With Airflow, you define a DAG in a Python script and place it in a designated folder. Airflow then scans the folder and turns those scripts into executable DAGs.
If you only have a few DAGs to run, it’s simple enough to use one script per DAG. However, as your work scales up and the number of DAGs you want to run grows, this will inevitably become challenging to maintain.
For example, suppose your team decides to add in a step to run tests on your data before finalizing a sync. If you want to add in a task to run tests in a whole bunch of DAGs, you will need to make that change in each and every script. Not only is it tedious and annoying to make the same change in lots of files, but it’s likely that you will make some mistakes along the way.
If you only have a few DAGs, using one Python script per DAG is fine. But if you need to run a lot of DAGs, that starts to get painful!
Scaling Up: Enter Daggy McDagface
When our DAG count began to climb as we onboarded our first partners in 2022, we knew we needed a more scalable solution to deploying pipelines at scale. The answer? A DAG generator script we call Daggy McDagface.
Instead of maintaining hundreds of separate Python files, we use a single generator script that reads lightweight, human-readable configuration YAML files. Each YAML file contains the minimal details needed to define a new pipeline, such as where the raw data is located, who the vendor is (ActBlue? HubSpot? Snapchat?), which dbt project to use to normalize the data (all of CTA’s dbt code is publicly viewable in this GitHub repository), and where in BigQuery the final data should be delivered.
This approach has worked really well for us, with several distinct advantages:
Flexibility: A new client or data sync only requires adding a new YAML file.
Ease of Updates: Making a change to all pipelines is as simple as modifying one file—the DAG generator script.
Reduced Human Error: Consistency is maintained across all DAGs, minimizing the risk of mistakes.
Example of a (fake!) configuration YAML that would configure a DAG that would run a Hubspot sync using Airbyte.
Bringing It All Together
By automating our data pipeline orchestration with Airflow and scaling effortlessly with our DAG generator, we’ve transformed a potentially unwieldy process into a streamlined, efficient operation.
Want to give this a try yourself? We got you! Check out this GitHub repository for a demonstration of how to run a DAG generator alongside configuration files in Airflow. If you also find yourself running lots of DAGs and want a better way to manage them, maybe this can inspire you to make your own Daggy McDagface! (Or, you could give it a more serious name… I guess.)
Happy DAG-ing,
-Emily and the CTA team