Firefox-CI ETL#

The Firefox-CI ETL is a data pipeline designed to expose data such as cost and worker metrics to Firefox-CI stakeholders, to help them make informed decisions about the operation and configuration of the Taskcluster instance.

Data#

This ETL stores data into a number of tables in the moz-fx-data-shared-prod project. They are:

Components#

The ETL consists of several components spread across various repositories.

Docker Image and Python Module#

A taskcluster-fxci-export docker image is defined in the mozilla/docker-etl repository. Changes to this directory cause the image to be re-built and pushed to Google Artifact Registry via CircleCI.

Warning

As of this writing, a member of the data engineering team must merge the PR. Otherwise the image will be built by CI, but not pushed to Google Artifact Registry.

The image contains the fxci-etl Python module, which provides the fxci-etl binary containing subcommands to run any necessary business logic pertaining to the Firefox-CI ETL. It is designed to be extensible, so can be re-used for future needs.

See the README for information on supported configuration and commands.

Telemetry Airflow DAGs#

Telemetry Airflow is the name of the Apache Airflow instance run by Mozilla’s data infrastructure team.

There are two Firefox-CI ETL related DAGs defined in the telemetry-airflow repo. These DAGs use the aforementioned docker image to run an fxci-etl command on a cron schedule. The DAGs are:

1. fxci_pulse_export - This DAG is responsible for draining some Taskcluster pulse queues and inserting records into the tasks_v1 and task_runs_v1 BigQuery tables.

2. fxci_metric_export - This DAG is responsible for querying metrics from Google Cloud Monitoring (namely worker uptime for now), and inserting records into the worker_metrics_v1 BigQuery table.

The DAGs use the latest version of the image, so no changes are required

Derived Tables#

Finally, there is a derived table that uses infrastructure in the bigquery-etl repository.

This is simply defined in a .sql file and is used to extract worker cost data from the GCP billing table and insert the result into the worker_costs_v1 BigQuery table.

See this bigquery-etl tutorial for more information on how the process works.