Comparing Nextflow and Airflow for Scientific Workflows

Mamarcus64
7 min readFeb 11, 2021

Background

The importance of functional and efficient data pipelines in bioinformatics cannot be overstated — they are the crux of any and all scientific workflows. The demand for frameworks that can create and manage these pipelines, which I will refer to as pipeline orchestrators, has led to the rise of several programs that aim to tie together processes, data streaming, and dependency packaging. This article will examine two of the more popular options for data-driven workflows, Nextflow and Airflow.

Nextflow

Channels and Processes

Nextflow is built around the idea of processes, which are reactive methods that receive input and produce output through channels. Each process runs every time its channel receives a new input, as shown in the following example:

Trivial example of a Nextflow script.

In this script, both processes are still reactive — after compile time, they simply wait for values for their input channels to run. In this example, the script itself initiates data into the “value” channel, which then prompts firstProcess to run; for each input, it runs the given bash shell command (other commands can be run as well). The output is then sent to the “appended” channel, which is where secondProcess picks up execution.

Environment Setup and Code Sharing

One of Nextflow’s main selling points is its ease of access; after installing Nextflow, no other environment setup is required. Each pipeline is fully contained within a single .nf file (potentially with some information in the Nextflow.config file as well). Within the same train of thought, sharing Nextflow pipelines is as simple as sharing the nf file. Integration with git is trivial as well; the Nextflow CLI will access online git repositories just as easily as local files. Other pipeline services operate outside of the project scope, so additional files and setup have to be arranged once a version is pulled from the repository. Because Nextflow’s pipeline code is self-contained in a single file that is located within the project directory, there are no longer any dependency issues with sharing pipeline code between users as that file is included within the git repository from the start.

Platform Agnostic Process Execution

Nextflow also supports process execution on virtually any platform. Nextflow works by defining an executor host, which is set by defining the process.executor variable in the config file — that’s it! By changing that one line, a pipeline can be run locally, on network clusters, on AWS cloud, or a variety of other executor options. In the case of cloud computing like AWS, credentials can be declared either in the config file or in a separate file location. Regardless of the platform, Nextflow parallelizes process execution

Docker Integration

Integrating with Docker is also fairly trivial with Nextflow. To reuse our old example, let’s set the first process to run in the Ubuntu Docker image and the second process to run in the Debian Docker image.

Adding Docker containers is as easy as adding one line of code.

The specification of the image in the first line of each process tells Nextflow that we want to run this process in a docker container. It first looks for local images then pulls them from DockerHub. Additionally, if we wanted to spin up docker containers with other options, for example if we wanted to specify a volume, docker run options can be configured in the Nextflow config file.

Error handling and logs

Nextflow also deals with errors very efficiently. Because of the asynchronous nature of its processes, Nextflow stores each output and success flag of every process; thus, if a certain process in the DAG fails, developers can fix that one process node and resume the work from there rather than start over. On a similar note, each process logs its values and individual processes can be examined after a pipeline run. Additionally, Nextflow can produce trace reports, timeline reports, DAG visualizations, and execution reports. An important thing to note is that Nextflow pretty much exclusively runs from the command line, which does limit its UI capabilities. Nextflow is much more focused on the development and iteration of single pipelines at a given time, as opposed to the concurrent management of a collection of pipelines. This philosophy corresponds to its command line input — while running a single pipeline is as easy as “Nextflow run my_project_folder,” running and scheduling various pipelines becomes much more difficult and harder to scale. When it comes to developing in a pipeline manager that focuses on a much broader scope and deals with a dynamic pipeline testing environment, Airflow checks more of the boxes.

Airflow

A Better Pipeline Manager

Airflow’s design philosophy revolves around creating workflows that automatically run, schedule, and visualize. Its main strength and differentiator comes from the way pipelines are run.

Image taken from the Airflow web server UI.

From the image above, we can already see that Airflow presents all of its pipelines neatly in one place and each can be run from a centralized location. Airflow’s scope seems to be one layer of abstraction higher than Nextflow; Airflow’s architecture is built around this web server UI and puts a lot more effort into visualization as well.

Creating DAGs within Airflow

Airflow’s DAG creation comes in the form of python scripting. Below is a screenshot of Airflow’s demo bash operator pipeline.

Basic Airflow script for a pipeline, written in Python.

Similar to Nextflow, Airflow has a series of tasks (with an associated script operator) that run in a specified order. There appears to be a lot more boilerplate code when it comes to Airflow’s scripting language; additionally, there is much less flexibility especially when it comes to the transfer and communication of data. Each task runs independently as a python method call, and data can be passed in through parameter variables, but the system is not as robust as Nextflow’s channels system.

Scheduling pipelines

If the web server UI can be thought of the body that holds all of the pipeline pieces together, then the Airflow scheduler is the heartbeat that keeps everything running. The scheduler is in charge of making sure pipelines get run, they are executed properly, and that communication between tasks is happening. Additionally, users can automatically have certain pipelines run at predetermined intervals, which depending on the nature of one’s workflow, could be extremely useful.

Error Handling

Airflow does not handle errors as well as Nextflow, but it still does a good job of visualizing critical nodes in the pipeline that cause failures. Because of its automatic DAG visualization generation, Airflow also is able to illustrate the success and failure of each node. However, the pipeline must be run from the start every time a DAG fails, unlike Nextflow.

Airflow seems great! However…

A quick survey of the scientific community reveals that virtually no one in a data-driven environment really uses Airflow, whereas Nextflow is gaining popularity. There are a few key reasons for this:

  • Hard to set up locally: setting up Airflow for a single user on a single computer is a tough task in itself; while Nextflow can be set up in a single command line input, Airflow has many moving parts, from the web server, executor, scheduler, and metadata database.
  • Cannot transfer data in between tasks: Airflow has trouble when it comes to ETL operations because data dependencies cannot be passed in between tasks. The best you could do is pass a reference to a database or to another file that the previous task worked on.
  • Hard to iterate fast: Airflow is not meant for pipelines to be run reactively or on demand; ALL pipelines need to have a start time and a schedule interval, even if they are meant to be run only when a user requests a run.
  • Not widely used in data science: While many companies do use airflow, in the data science field, there are almost no users. This fact itself doesn’t give too much insight into why, but it’s a good litmus test to be wary of Airflow in data science applications.

Conclusion

Even though I have spent this article differentiating both Nextflow and Airflow, It is not fair to compare both directly as substitutes. While there is much overlap in functionality, Airflow and Nextflow have different design philosophies that drive their features. Nextflow focuses on the creation of fast, iterable pipelines with extensive logs and error handling. Airflow directs its attention to managing those pipelines in a more robust manner and controlling the environment in which those pipelines are run. As mentioned previously, Airflow also does not seem to fare well in the data science industry. So, when deciding between Airflow and Nextflow, it really depends on the situation. In a ETL-heavy environment, I would probably not recommend Airflow, since the scheduling philosophy that Airflow presents does not really coincide with the use cases that ETL operations require. In that case, where making fast pipelines that package easily with their project source code and that can be run easily, Nextflow would be superior. In the event that you need to manage many pipelines simultaneously and are focusing more on the broader scope rather than each individual pipeline, then Airflow is your best bet. Nextflow and Airflow are only two possible pipeline managers, as well — there are alternatives like Luigi, Prefect, and CWL, just to name a few. Each workflow management software brings their own individual features; before selecting a workflow software, it is well worth the extra time to investigate all possible options.

--

--