Developer’s Guide

There are three main triggers to initate new development:

  1. A new lineage dataset has been released.

  2. A new nextclade-cli has been released.

  3. A new recombinant lineage has been proposed.

  4. A new recombinant lineage has been designated.

    • New designations should be labelled as milestones.

Beging by creating a development conda environment.

mamba env create -f workflow/envs/environment.yaml -n ncov-recombinant-dev

Note: After completing a development trigger, proceed with the Validation section.

Trigger 1 | New Lineage Dataset

  1. Obtain the new dataset tag from https://github.com/nextstrain/nextclade_data/releases.

  2. Update all 3 dataset tags in defaults/parameters.yaml.

    For example, change all strings of "2022-10-27T12:00:00Z" to the new dataset tag for November such as "2022-11-27T12:00:00Z".

    - name: nextclade_dataset
      dataset: sars-cov-2
      tag: "2022-10-27T12:00:00Z"
      dataset_no-recomb: sars-cov-2-no-recomb
      tag_no-recomb: "2022-10-27T12:00:00Z"
      dataset_immune-escape: sars-cov-2-21L
      tag_immune-escape: "2022-10-27T12:00:00Z"
    

Trigger 2 | New Nextclade CLI

  1. Check that the new nextclade-cli has been made available on conda: https://anaconda.org

  2. Update the following line in workflow/envs/environment.yaml to the newest version:

    - bioconda::nextclade=2.8.0
    
  3. Update the development conda environment.

    mamba env update -f workflow/envs/environment.yaml -n ncov-recombinant-dev
    

Trigger 3 | New Proposed Lineage

  1. Create a new data directory.

    mkdir -p data/proposedXXX
    
  2. Check the corresponding pango-designation issue for a list of GISAID accessions.

  3. Download 10 of these GISAID accessions from https://gisaid.org/. Please review the GISAID Download section to ensure the sequences and metadata are correctly formatted.

  4. Create a new pipeline profile for this lineage.

    scripts/create_profile.sh --data data/proposedXXX
    
  5. Run the pipeline up to sc2rf breakpoint detection.

    snakemake --profile my_profiles/proposedXXX results/proposedXXX/sc2rf/stats.tsv
    

    When finished, check the stats file and confirm whether sc2rf_status is positive, sc2rf_parents match the pango-designation issue information. If so, skip to the Validation section.

    csvtk pretty -t results/proposedXXX/sc2rf/stats.tsv | less -S
    
    # Or
    csvtk cut -t -f "strain,sc2rf_status,sc2rf_parents,sc2rf_breakpoints" results/XBB/sc2rf/stats.tsv
    
  6. Update/create sc2rf modes.

    If sc2rf_status is not positive, command-line parameters for sc2rf will need to be tweaked. To begin with, run sc2rf manually with the following debugging parameters. And changing --clades BA.2 BA.5.2 to a space separated list of potential parents.

    python3 sc2rf/sc2rf.py results/proposedXXX/nextclade/alignment.fasta \
      --ansi \
      --max-ambiguous 20 \
      --max-intermission-length 2 \
      --ignore-shared \
      --mutation-threshold 0.25 \
      --max-intermission-count 20 \
      --parents 0-10 \
      --breakpoints 0-10 \
      --unique 0 \
      --clades BA.2 BA.5.2
    

    If the potential parents don’t appear in the output, they will need to be added to sc2rf/mapping.csv and the allele frequency database updated with:

    cd sc2rf/
    python3 sc2rf.py --rebuild-examples
    cd ..
    

    Then rerun the previous command and verify that the parents appear in the output. The next step is to increase the stringency of --parents, --breakpoints, --unique, and --max-intermission-count as much as possible.

    Once a parameter set is found, review the existing sc2rf modes in defaults/parameters.yaml. It is ideal to have as few modes as possible, so the first check is to see whether the new parameter set can be integrated into an existing mode. Failing that, a new mode should be created in the list to capture this recombinant.

    - name: sc2rf
      mode:
      # Lineage specific validation)
        - XA:              "--clades 20I 20E 20F 20D             --ansi --parents 2   --breakpoints 1-3  --unique 2 --max-ambiguous 20 --max-intermission-length 2 --max-intermission-count 3  --ignore-shared --mutation-threshold 0.25"
        ...
        # Current Variants of Concern and Dominant Lineages
        - voc:             "--clades BA.2.10 BA.2.3.17 BA.2.3.20 BA.2.75 BA.4.6 BA.5.2 BA.5.3 XBB --ansi --parents 2-4 --breakpoints 1-5 --unique 1 --max-ambiguous 20 --max-intermission-length 2 --max-intermission-count 3  --ignore-shared --mutation-threshold 0.25"
    
  7. Repeat steps 5 and 6, until the recombinant is successully identified by sc2rf.

Trigger 4 | New Designated Lineage

  1. Complete all steps for Trigger 3 | New Proposed Lineage except:

    • Include no more than 10 representative sequences.

    • Make the data directory data/X* instead of data/proposed*.

  2. Add the new designated lineage to the controls-gisaid dataset.

    data_dir="data/XBM"
    
    # Add in new control strains
    csvtk cut -t -f "strain" ${data_dir}/metadata.tsv | tail -n+2 >> data/controls-gisaid/strains.txt
    
    # Add in new control metadata
    cols=$(csvtk headers -t data/controls-gisaid/metadata.tsv | tr "\n" "," | sed 's/,$/\n/g')
    csvtk cut -t -f "$cols" ${data_dir}/metadata.tsv | tail -n+2 >> data/controls-gisaid/metadata.tsv  
    
    # Add in new control sequences
    cat ${data_dir}/sequences.fasta >> data/controls-gisaid/sequences.fasta
    

Validation

Run the following profiles to validate the new changes. These profiles all contain strains with expected pipeline output in defaults/validation.tsv.

  1. Tutorial

    scripts/slurm.sh --profile profiles/tutorial --conda-env ncov-recombinant-dev
    
  2. Genbank Controls

    scripts/slurm.sh --profile profiles/controls-hpc --conda-env ncov-recombinant-dev
    
  3. GISAID Controls

    scripts/slurm.sh --profile profiles/controls-gisaid-hpc --conda-env ncov-recombinant-dev
    

If the pipeline failed validation, check the end of the log for details.

less -S logs/ncov-recombinant/tutorial_$(date +'%Y-%m-%d').log

If the column that failed is only lineage, because lineage assignments have changed with the new nextclade dataset, simply update the values in defaults/validation.tsv. This is expected when upgrading nextclade-cli or the nextclade dataset.

If the column that failed is breakpoints or parents_clade, this indicates a more complicated issue with breakpoint detection. The most common reason is because sc2rf modes has been changed to capture a new lineage (see Trigger 3 for more information). This presents an optimization challenge, and is solved by working through steps 5 and 6 of Trigger 3 | New Proposed Lineage until sensitivity and specificity are restore for all lineages.

Once validation is completed successfully for all profiles, update defaults/validation.tsv as follows:

# Construct headers
echo -e "strain\tstatus\tlineage\tbreakpoints\tparents_clade\tdataset" > defaults/validation.tsv

# Tutorial
csvtk cut -t -f "strain,status,lineage,breakpoints,parents_clade"  results/tutorial/linelists/linelist.tsv \
    | tail -n+2 \
    >> defaults/validation.tsv

# Controls Genbank
csvtk cut -t -f "strain,status,lineage,breakpoints,parents_clade"  results/controls/linelists/linelist.tsv \
    | csvtk mutate2 -t -n "dataset" -e '"controls"' \
    | tail -n+2 \
    >> defaults/validation.tsv

# Controls GISAID
csvtk cut -t -f "strain,status,lineage,breakpoints,parents_clade"  results/controls-gisaid/linelists/linelist.tsv \
    | csvtk mutate2 -t -n "dataset" -e '"controls-gisaid"' \
    | tail -n+2 \
    >> defaults/validation.tsv