Running GATK4 Spark on Quest

Body

New Documentation Platform for Northwestern IT Research Computing and Data Services

Please reference https://rcdsdocs.it.northwestern.edu for the new technical documentation related to the Research Computing and Data Services. In December of 2025, the Knowledge Base content on this page will be removed and replaced with a link to the new documentation platform. Until that time, this Knowledge Base article will no longer be updated, and the latest information can be found on the new documentation platform page . If you have any questions, please reach out to quest-help@northwestern.edu for assistance.

The Broad Institute’s Genome Analysis Toolkit (GATK) is a widely used best practices pipeline for variant calling. As of GATK version 4, many GATK tools are also available to run on Apache Spark, a unified analytics engine for large-scale data processing which can significantly speed up computation time. Note that GATK4’s Spark tools are currently in beta. Because GATK4 SPARK runs on multiple nodes, it must be launched with a job submission script and cannot be run on a login node.

GATK4 SPARK tools

To see a list of available GATK4 tools:

module load gatk/4.0.4

gatk –-list

GATK Spark tools have the word “Spark” in their name, and can be explicitly listed with grep:

gatk –-list | grep Spark

Additional help is available for each tool with:

gatk ToolName --help

Converting FASTA Files for Parallel Runs

Standard FASTA files do not allow for parallel operation and must be converted to 2bit files. To convert these files, use the binary in $SPARK_TOOLS which is in the SPARK module path.

module load spark/2.3.0

faToTwoBit exampleFASTA.fasta exampleFASTA.2bit

This step only needs to be done once.

Grouping BAM files

Tools using BAM files are optimized to run on queryname-grouped alignments (that is, all reads with the same queryname are together in the input file). If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.

Example Job Submission

Below is an example of job submission file for Quest using a converted exampleFASTA.2bit file.

GATKSpark_example.sh
#!/bin/bash
#SBATCH -A <allocation>      # Allocation 
#SBATCH -p <partition_name>  # Partition
#SBATCH -t 00:20:00          # Walltime/duration of job
#SBATCH -N 2                 # Number of nodes
#SBATCH --ntasks-per-node=24 # Number of cores (processors)
#SBATCH --mem-per-cpu=5G     # GB needed per-core for a job
#SBATCH -J "SparkTest"       # Name of job

# Load environment
module purge all
module load spark/2.3.0 gatk/4.0.4

cd $SLURM_SUBMIT_DIR

# Initialize spark cluster on hosts allocated to your job
$SPARK_TOOLS/initialize_spark.sh

# Run GATK HaplotypeCaller in Spark
gatk HaplotypeCallerSpark \
--reference $SLURM_SUBMIT_DIR/exampleFASTA.2bit \
--input $SLURM_SUBMIT_DIR/exampleBAM.bam \
--output $SLURM_SUBMIT_DIR/exampleVCF.txt \
--spark-runner SPARK \
--spark-master spark://`hostname -i`:7077 \
-- --driver-cores=2 --driver-memory=6g \
--executor-cores=22 --executor-memory=114GB 2>&1

# Cleanup spark cluster on hosts allocated to your job

$SPARK_TOOLS/cleanup_spark.sh

Details

Details

Article ID: 1758
Created
Thu 5/12/22 1:39 PM
Modified
Mon 11/3/25 12:14 PM