Developer’s Guide
There are three main triggers to initate new development:
A new lineage dataset has been released.
A new nextclade-cli has been released.
A new recombinant lineage has been designated.
New designations should be labelled as milestones.
Beging by creating a development conda environment.
mamba env create -f workflow/envs/environment.yaml -n ncov-recombinant-dev
Note: After completing a development trigger, proceed with the Validation section.
Trigger 1 | New Lineage Dataset
Obtain the new dataset tag from https://github.com/nextstrain/nextclade_data/releases.
Update all 3 dataset tags in
defaults/parameters.yaml.For example, change all strings of
"2022-10-27T12:00:00Z"to the new dataset tag for November such as"2022-11-27T12:00:00Z".- name: nextclade_dataset dataset: sars-cov-2 tag: "2022-10-27T12:00:00Z" dataset_no-recomb: sars-cov-2-no-recomb tag_no-recomb: "2022-10-27T12:00:00Z" dataset_immune-escape: sars-cov-2-21L tag_immune-escape: "2022-10-27T12:00:00Z"
Trigger 2 | New Nextclade CLI
Check that the new nextclade-cli has been made available on conda: https://anaconda.org
Update the following line in
workflow/envs/environment.yamlto the newest version:- bioconda::nextclade=2.8.0
Update the development conda environment.
mamba env update -f workflow/envs/environment.yaml -n ncov-recombinant-dev
Trigger 3 | New Proposed Lineage
Create a new data directory.
mkdir -p data/proposedXXX
Check the corresponding pango-designation issue for a list of GISAID accessions.
Download 10 of these GISAID accessions from https://gisaid.org/. Please review the GISAID Download section to ensure the sequences and metadata are correctly formatted.
Create a new pipeline profile for this lineage.
scripts/create_profile.sh --data data/proposedXXX
Run the pipeline up to
sc2rfbreakpoint detection.snakemake --profile my_profiles/proposedXXX results/proposedXXX/sc2rf/stats.tsv
When finished, check the stats file and confirm whether
sc2rf_statusispositive,sc2rf_parentsmatch the pango-designation issue information. If so, skip to the Validation section.csvtk pretty -t results/proposedXXX/sc2rf/stats.tsv | less -S # Or csvtk cut -t -f "strain,sc2rf_status,sc2rf_parents,sc2rf_breakpoints" results/XBB/sc2rf/stats.tsv
Update/create
sc2rfmodes.If
sc2rf_statusis notpositive, command-line parameters forsc2rfwill need to be tweaked. To begin with, runsc2rfmanually with the following debugging parameters. And changing--clades BA.2 BA.5.2to a space separated list of potential parents.python3 sc2rf/sc2rf.py results/proposedXXX/nextclade/alignment.fasta \ --ansi \ --max-ambiguous 20 \ --max-intermission-length 2 \ --ignore-shared \ --mutation-threshold 0.25 \ --max-intermission-count 20 \ --parents 0-10 \ --breakpoints 0-10 \ --unique 0 \ --clades BA.2 BA.5.2
If the potential parents don’t appear in the output, they will need to be added to
sc2rf/mapping.csvand the allele frequency database updated with:cd sc2rf/ python3 sc2rf.py --rebuild-examples cd ..
Then rerun the previous command and verify that the parents appear in the output. The next step is to increase the stringency of
--parents,--breakpoints,--unique, and--max-intermission-countas much as possible.Once a parameter set is found, review the existing
sc2rfmodes indefaults/parameters.yaml. It is ideal to have as few modes as possible, so the first check is to see whether the new parameter set can be integrated into an existing mode. Failing that, a new mode should be created in the list to capture this recombinant.- name: sc2rf mode: # Lineage specific validation) - XA: "--clades 20I 20E 20F 20D --ansi --parents 2 --breakpoints 1-3 --unique 2 --max-ambiguous 20 --max-intermission-length 2 --max-intermission-count 3 --ignore-shared --mutation-threshold 0.25" ... # Current Variants of Concern and Dominant Lineages - voc: "--clades BA.2.10 BA.2.3.17 BA.2.3.20 BA.2.75 BA.4.6 BA.5.2 BA.5.3 XBB --ansi --parents 2-4 --breakpoints 1-5 --unique 1 --max-ambiguous 20 --max-intermission-length 2 --max-intermission-count 3 --ignore-shared --mutation-threshold 0.25"
Repeat steps 5 and 6, until the recombinant is successully identified by
sc2rf.
Trigger 4 | New Designated Lineage
Complete all steps for Trigger 3 | New Proposed Lineage except:
Include no more than 10 representative sequences.
Make the data directory
data/X*instead ofdata/proposed*.
Add the new designated lineage to the
controls-gisaiddataset.data_dir="data/XBM" # Add in new control strains csvtk cut -t -f "strain" ${data_dir}/metadata.tsv | tail -n+2 >> data/controls-gisaid/strains.txt # Add in new control metadata cols=$(csvtk headers -t data/controls-gisaid/metadata.tsv | tr "\n" "," | sed 's/,$/\n/g') csvtk cut -t -f "$cols" ${data_dir}/metadata.tsv | tail -n+2 >> data/controls-gisaid/metadata.tsv # Add in new control sequences cat ${data_dir}/sequences.fasta >> data/controls-gisaid/sequences.fasta
Validation
Run the following profiles to validate the new changes. These profiles all contain strains with expected pipeline output in defaults/validation.tsv.
Tutorial
scripts/slurm.sh --profile profiles/tutorial --conda-env ncov-recombinant-dev
Genbank Controls
scripts/slurm.sh --profile profiles/controls-hpc --conda-env ncov-recombinant-dev
GISAID Controls
scripts/slurm.sh --profile profiles/controls-gisaid-hpc --conda-env ncov-recombinant-dev
If the pipeline failed validation, check the end of the log for details.
less -S logs/ncov-recombinant/tutorial_$(date +'%Y-%m-%d').log
If the column that failed is only lineage, because lineage assignments have changed with the new nextclade dataset, simply update the values in defaults/validation.tsv. This is expected when upgrading nextclade-cli or the nextclade dataset.
If the column that failed is breakpoints or parents_clade, this indicates a more complicated issue with breakpoint detection. The most common reason is because sc2rf modes has been changed to capture a new lineage (see Trigger 3 for more information). This presents an optimization challenge, and is solved by working through steps 5 and 6 of Trigger 3 | New Proposed Lineage until sensitivity and specificity are restore for all lineages.
Once validation is completed successfully for all profiles, update defaults/validation.tsv as follows:
# Construct headers
echo -e "strain\tstatus\tlineage\tbreakpoints\tparents_clade\tdataset" > defaults/validation.tsv
# Tutorial
csvtk cut -t -f "strain,status,lineage,breakpoints,parents_clade" results/tutorial/linelists/linelist.tsv \
| tail -n+2 \
>> defaults/validation.tsv
# Controls Genbank
csvtk cut -t -f "strain,status,lineage,breakpoints,parents_clade" results/controls/linelists/linelist.tsv \
| csvtk mutate2 -t -n "dataset" -e '"controls"' \
| tail -n+2 \
>> defaults/validation.tsv
# Controls GISAID
csvtk cut -t -f "strain,status,lineage,breakpoints,parents_clade" results/controls-gisaid/linelists/linelist.tsv \
| csvtk mutate2 -t -n "dataset" -e '"controls-gisaid"' \
| tail -n+2 \
>> defaults/validation.tsv