Firefox-CI ETL#
The Firefox-CI ETL is a data pipeline designed to expose data such as cost and worker metrics to Firefox-CI stakeholders, to help them make informed decisions about the operation and configuration of the Taskcluster instance.
Data#
This ETL stores data into a number of tables in the moz-fx-data-shared-prod
project. They are:
- moz-fx-data-shared-prod.fxci.tasks_v1
Primary key:
task_id
- moz-fx-data-shared-prod.fxci.task_runs_v1
Primary key:
task_id
,run_id
- moz-fx-data-shared-prod.fxci.worker_metrics_v1
Primary key:
project
,zone
,instance_id
- moz-fx-data-shared-prod.fxci_derived.worker_costs_v1
Primary key:
project
,zone
,instance_id
Components#
The ETL consists of several components spread across various repositories.
Docker Image and Python Module#
A taskcluster-fxci-export docker image is defined in the mozilla/docker-etl repository. Changes to this directory cause the image to be re-built and pushed to Google Artifact Registry via CircleCI.
Warning
As of this writing, a member of the data engineering team must merge the PR. Otherwise the image will be built by CI, but not pushed to Google Artifact Registry.
The image contains the fxci-etl Python module, which
provides the fxci-etl
binary containing subcommands to run any necessary
business logic pertaining to the Firefox-CI ETL. It is designed to be
extensible, so can be re-used for future needs.
See the README for information on supported configuration and commands.
Telemetry Airflow DAGs#
Telemetry Airflow is the name of the Apache Airflow instance run by Mozilla’s data infrastructure team.
There are two Firefox-CI ETL related DAGs defined in the telemetry-airflow repo. These DAGs use the aforementioned docker image to run an fxci-etl command on a cron schedule. The DAGs are:
1. fxci_pulse_export - This DAG is
responsible for draining some Taskcluster pulse queues and inserting records
into the tasks_v1
and task_runs_v1
BigQuery tables.
2. fxci_metric_export - This DAG is
responsible for querying metrics from Google Cloud Monitoring (namely worker
uptime for now), and inserting records into the worker_metrics_v1
BigQuery table.
The DAGs use the latest version of the image, so no changes are required
Derived Tables#
Finally, there is a derived table that uses infrastructure in the bigquery-etl repository.
This is simply defined in a .sql file and is used to extract
worker cost data from the GCP billing table and insert the result into the
worker_costs_v1
BigQuery table.
See this bigquery-etl tutorial for more information on how the process works.