Commands and Usage

ViralQC provides three main commands through the command-line interface (vqc).

get-nextclade-datasets

Downloads and configures Nextclade datasets locally.

Important

This command must be run at least once before using run.

Usage

vqc get-nextclade-datasets --cores 2

Parameters

Parameter

Type

Default

Description

--datasets-dir

String

datasets

Directory where datasets will be stored

--cores

Integer

1

Number of threads/cores to use

--verbose

Boolean

False

Show snakemake logs

Output Structure

datasets/
├── nextclade_data/
│   ├── denv1/
│   ├── denv2/
│   └── ...
├── external_datasets/
│   └── zikav/
└── external_datasets_minimizers.json

get-blast-database

Creates a local BLAST database containing all viral genomes from NCBI RefSeq.

Important

This command must be run at least once before using run.

Usage

vqc get-blast-database

Parameters

Parameter

Type

Default

Description

--output-dir

String

datasets

Directory where the BLAST database will be stored

--release-date

String

None

Filter sequences by release date (YYYY-MM-DD). Only sequences released on or before this date will be included

--cores

Integer

1

Number of threads/cores to use

--verbose

Boolean

False

Show snakemake logs

Release Date Filtering

The --release-date parameter allows you to create a reproducible BLAST database by filtering sequences based on their NCBI release date:

# Create database with all sequences released up to June 15, 2023
vqc get-blast-database --release-date 2023-06-15

Behavior:

  • When --release-date is provided:

    • Only sequences with release_date <= specified_date are included

    • The specified date is used as the database version identifier

  • When not provided:

    • All available RefSeq sequences are included

    • Current date is used as the database version identifier

This is useful for:

  • Reproducibility: Recreate the same database at different points in time

  • Auditing: Track which sequences were available at a specific date

  • Comparative studies: Analyze how results change with database updates

Database Version

The database version is recorded in blast.tsv metadata file:

  • Format: ncbi-refseq-virus_YYYY-MM-DD

  • Uses the --release-date value if provided, otherwise the current date

Output Structure

datasets/
├── blast.fasta          # Reference sequences
├── blast.fasta.ndb      # BLAST database files
├── blast.fasta.nhr
├── blast.fasta.nin
├── blast.fasta.nsq
├── blast.tsv            # Metadata with version info
└── blast_gff/           # GFF3 files for generic analysis

run

Main analysis command. Identifies viruses, performs quality control, and extracts target regions.

Usage

vqc run --input my_sequences.fasta

Required Parameters

Parameter

Type

Description

--input

String

Path to the input FASTA file

Output Parameters

Parameter

Type

Default

Description

--output-dir

String

outputs

Working directory. Results will be stored in an outputs/ subdirectory within this folder.

--output-file

String

results.tsv

Results file (.tsv, .csv, or .json)

Dataset Parameters

Parameter

Type

Default

Description

--datasets-dir

String

datasets

Path to Nextclade datasets directory

Nextclade Sort Parameters

Parameter

Type

Default

Description

--ns-min-score

Float

0.1

Minimum score for valid match

--ns-min-hits

Integer

10

Minimum hits for valid match

BLAST Parameters

Parameter

Type

Default

Description

--blast-database

String

datasets/blast.fasta

Path to BLAST database

--blast-database-metadata

String

datasets/blast.tsv

Path to BLAST metadata

--blast-pident

Integer

80

Minimum percent identity (0-100)

--blast-evalue

Float

1e-10

Maximum E-value

--blast-qcov

Integer

80

Minimum query coverage (0-100)

--blast-task

String

megablast

BLAST task type

BLAST Task Types

The --blast-task parameter controls the BLAST algorithm sensitivity:

Task

Description

Use Case

megablast

Highly similar sequences (default)

Fast, same species

dc-megablast

Discontiguous megablast

Cross-species, more sensitive

blastn

Traditional BLASTN

More distant sequences

blastn-short

Short sequences

Sequences < 50 bp

Examples:

# Default (megablast) - fast, for similar sequences
vqc run --input seqs.fasta

# More sensitive search for distant viruses
vqc run --input seqs.fasta --blast-task dc-megablast

# Traditional BLASTN for divergent sequences
vqc run --input seqs.fasta --blast-task blastn

System Parameters

Parameter

Type

Default

Description

--cores

Integer

1

Number of threads/cores

--verbose

Boolean

False

Show snakemake logs

Complete Example

vqc run \
  --input samples.fasta \
  --output-dir results \
  --output-file report.tsv \
  --blast-pident 75 \
  --blast-task dc-megablast \
  --cores 8

Analysis Workflow

  1. Nextclade Sort: Maps sequences to local datasets

  2. BLAST Analysis: Identifies unmapped sequences

  3. Nextclade Run: Quality control analysis

  4. Post-processing: Combines and scores results

  5. Region Extraction: Extracts target regions based on quality

API

How the viralQC is designed to integrate with viral genomic databases, it is possible to integrate the analysis module in the code of other applications.

This can be done by importing the RunAnalysis class from the viralqc.core.run_analysis module. This class has the run method that executes the quality analysis of a viral genome, receiving as parameter the path to the FASTA file containing the sequences to be analyzed. Other parameters can be informed in an optimized way.

Usage

from viralqc.core.run_analysis import RunAnalysis

input_file = "seqs.fasta"
output_directory = "results"
output_file = "results.json"
run_analysis = RunAnalysis()

snakemake_response = run_analysis.run(
        sequences_fasta=input_file,
        output_dir=output_directory,
        output_file=output_file
)

Or a flexible approach:

from viralqc.core.run_analysis import RunAnalysis

input_file = "seqs.fasta"
output_directory = "results"
output_file = "results.json"
run_analysis = RunAnalysis()

snakemake_response = run_analysis.run(
        sequences_fasta=input_file,
        output_dir=output_directory,
        output_file=output_file,
        cores=2,
        datasets_local_path="datasets",
        nextclade_sort_min_score=0.1,
        nextclade_sort_min_hits=10,
        blast_database="datasets/blast.fasta",
        blast_database_metadata="datasets/blast.tsv",
        blast_identity_threshold=0,
        blast_evalue=0.01,
        blast_qcov=0,
        blast_task="blastn"
)

To check the results:

if snakemake_response.status == 200:
    results_data = snakemake_response.get_results()
    for seq_result in results_data:
        virus = seq_result.get("virus")
        quality = seq_result.get("genomeQuality")
        coverage = seq_result.get("coverage")
        print(virus, quality, coverage)
else:
    raise Exception(snakemake_response.format_log())

Attributes and Methods

The run method returns a SnakemakeResponse object that has the following attributes:

Attribute

Type

Description

run_id

str

Execution ID

status

RunStatus

Execution status, which can be 200 (success) or 500 (failure)

log_path

str

Path to the log file

results_path

str

Path to the results file

And the following methods:

Method

Description

format_log()

Returns the log file content formatted

get_results()

Returns the results file content in dictionary format