Output Structure
When you run run-from-fasta, ViralQC creates the following output:
outputs/ # User specified output directory (e.g., --output-dir my_results)
└── .snakemake/ # Snakemake run files
└── outputs/ # ViralQC output files
├── identified_datasets/
│ ├── datasets_selected.tsv
│ ├── viruses.tsv
│ ├── viruses.external_datasets.tsv
│ ├── unmapped_sequences.txt
│ └── <virus>/sequences.fa
├── blast_results/
│ ├── unmapped_sequences.blast.tsv
│ └── blast_viruses.list
├── nextclade_results/
│ ├── <virus>.nextclade.tsv
│ └── <accession>.generic.nextclade.tsv
├── gff_files/
│ ├── <virus>.nextclade.gff
│ ├── <accession>.generic.nextclade.gff
│ └── per_sample/
│ └── <id>_<sample_name>.gff
├── tbl_files/
│ ├── <virus>.nextclade.tbl
│ ├── <accession>.generic.nextclade.tbl
│ └── per_sample/
│ └── <id>_<sample_name>.tbl
├── logs/
│ ├── nextclade_sort.log
│ ├── blast.log
│ └── ...
├── results.tsv
├── sequences_target_regions.bed
└── sequences_target_regions.fasta
Main File: results.tsv (or .csv, .json)
This is the file containing consolidated results from all analyses:
1. Sequence Identification
Column |
Type |
Description |
|---|---|---|
|
String |
Sequence name in the input FASTA file |
|
String |
Identified virus name |
|
Integer |
Virus taxonomic ID in NCBI Taxonomy |
|
String |
Viral species name |
|
Integer |
Species taxonomic ID |
|
String |
Genomic segment (e.g., “HA”, “NA”, “Unsegmented”) |
|
String |
Reference genome accession in NCBI |
|
String |
Dataset identifier used |
|
String |
Dataset version/tag |
|
String |
Phylogenetic clade (when available) |
3. Quality Metrics (Nextclade)
Column |
Type |
Description |
|---|---|---|
|
Float |
Nextclade overall quality score |
|
String |
Nextclade quality status (good, mediocre, bad) |
|
Integer |
Total private mutations (Nextclade) |
|
Float |
Private mutations score (Nextclade) |
|
String |
Private mutations status (Nextclade) |
|
Float |
Missing data score (Nextclade) |
|
String |
Missing data status (Nextclade) |
|
Integer |
Total mixed sites (Nextclade) |
|
Float |
Mixed sites score (Nextclade) |
|
String |
Mixed sites status (Nextclade) |
|
Integer |
Total clustered SNPs (Nextclade) |
|
Float |
SNP clusters score (Nextclade) |
|
String |
SNP clusters status (Nextclade) |
|
Integer |
Total frameshifts (Nextclade) |
|
Float |
Frameshifts score (Nextclade) |
|
String |
Frameshifts status (Nextclade) |
|
Integer |
Total stop codons (Nextclade) |
|
Float |
Stop codons score (Nextclade) |
|
String |
Stop codons status (Nextclade) |
4. Coverage and Regions
Column |
Type |
Description |
|---|---|---|
|
Float |
Genome coverage (0.0 to 1.0) |
|
String |
Coverage of each CDS (format: “gene1: 0.98, gene2: 1.0”) |
|
String |
Coverage of target regions defined in |
|
String |
Coverage of target gene defined in |
|
String |
List of target regions (separated by |) |
|
String |
Main target gene name |
5. Nucleotide Mutations
Column |
Type |
Description |
|---|---|---|
|
Integer |
Total nucleotide substitutions |
|
Integer |
Total nucleotide deletions |
|
Integer |
Total nucleotide insertions |
|
Integer |
Total frameshift mutations |
|
Integer |
Total missing nucleotides (N’s or gaps) |
|
Integer |
Total non-ACGTN characters |
|
String |
List of substitutions (format: gene:pos:ref>alt) |
|
String |
List of deletions |
|
String |
List of insertions |
|
String |
List of frameshifts |
|
Float |
Alignment score |
6. Amino Acid Mutations
Column |
Type |
Description |
|---|---|---|
|
Integer |
Total amino acid substitutions |
|
Integer |
Total amino acid deletions |
|
Integer |
Total amino acid insertions |
|
Integer |
Total unknown amino acids |
|
String |
List of amino acid substitutions |
|
String |
List of amino acid deletions |
|
String |
List of amino acid insertions |
7. Private Mutations (Detailed)
Column |
Type |
Description |
|---|---|---|
|
Integer |
Total private substitutions |
|
Integer |
Total known/cataloged private mutations |
|
Integer |
Total uncataloged private mutations |
|
Integer |
Total reversions (mutations that revert to ancestral reference) |
Note on output formats:
TSV/CSV: All columns are strings or numeric values
JSON: Columns like
cdsCoverage,cdsCoverageQuality, andtargetRegionsCoverageare formatted as arrays of objects for easier programmatic parsing
Target Regions Files
sequences_target_regions.bed
seq1 94 2419 C,prM,E
seq2 0 10735 genome
sequences_target_regions.fasta
Extracted sequences from regions meeting quality criteria.
Annotation Files
gff_files/
Contains GFF3 annotation files produced by Nextclade for each virus dataset. Each file covers all samples analyzed for that virus (identified by numeric IDs).
File |
Description |
|---|---|
|
Multi-sample GFF from the standard Nextclade run |
|
Multi-sample GFF from the BLAST-based generic Nextclade run |
gff_files/per_sample/
One GFF file per sample, automatically split from the combined output above. The numeric sequence ID used internally by viralQC is replaced with the original sample header throughout the file (sequence region directive, column 1, and feature attributes).
Filename format:
{id}_{sample_name}.gff
tbl_files/
Contains 5-column feature table (TBL) annotation files produced by Nextclade for each virus dataset. The TBL format is compatible with NCBI submission tools.
File |
Description |
|---|---|
|
Multi-sample TBL from the standard Nextclade run |
|
Multi-sample TBL from the BLAST-based generic Nextclade run |
tbl_files/per_sample/
One TBL file per sample, automatically split from the combined output above.
The >Feature <id> header in each block is replaced with the original sample
name.
Filename format:
{id}_{sample_name}.tbl
Note: Files in
per_sample/are empty (zero-byte) for samples where Nextclade was executed without a reference GFF file.