How CTA Uses CTA
Our mission is about delivering timely, secure, and reliable data to our partner organizations — so it only makes sense that we’d use our own tools, as well.
In a nutshell, CTA uses our data to monitor our data. Data is the currency of optimizing all things, including itself! Here’s how we do it:
Monitoring of Data Pipelines: First, we use our own data in PAD to monitor the status of the many data pipelines we orchestrate in Google Cloud Composer. What data, you might be asking? As part of our ELT process (“Extract, Load, and Transform”), we use dbt to transform data, run unit tests to ensure data quality, and deliver it into PAD accounts. While running dbt, we use Elementary, a tool that works with dbt to collect data about each run and send that data into our own BigQuery project. The data Elementary provides is a treasure trove for observing our own data pipelines. For example, Elementary tells us how many errors or failures there were when running dbt models and tests, even reporting those failures automatically to Slack so we’re alerted right away.
Visualizing the Data: We then connect that data into a Looker Studio, generating a report, called our “Pipeline Health Dashboard” (or its cheeky acronym, “PHD”). The report gives us a clean and crisp view of all the syncs we run each day, flagging any errors that happened, and even displays the text of an error message.
If you’re anything like us, this might be getting you excited about monitoring your own data. The good news is that BigQuery gives you everything you need to get started right away! Every dataset in BigQuery has a set of views, called INFORMATION_SCHEMA views (such as INFORMATION_SCHEMA.TABLES), that you can query to access basic metadata about your data. For example, you can measure data freshness using a query like this:
> SELECT > > > table_name AS table_id, > > TIMESTAMP_MILLIS(last_modified_time) AS last_updated_timestamp, > > TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), TIMESTAMP_MILLIS(last_modified_time), MINUTE) AS minutes_since_last_updated > > FROM > > `your_project_id.your_dataset_name.INFORMATION_SCHEMA.TABLES` > > ORDER BY > > last_updated_timestamp DESC; >
There are lots of cool things you can look at using the INFORMATION_SCHEMA views; you can read more about them here.
All of CTA’s internal data monitoring is powered by data we deliver into, and retrieve from, BigQuery. We love the reliability, convenience, and flexibility of BigQuery, which is why it’s our data platform of choice both for ourselves and PAD users.