Dataset Configuration
ViralQC uses a configuration file called datasets.yml to define which viruses and datasets are available for analysis. This file is located at viralqc/config/datasets.yml.
datasets.yml File Structure
The file has two main sections:
nextclade_data: Datasets hosted in the official Nextclade repositorygithub: Custom datasets hosted on GitHub
nextclade_data:
virus-identifier:
dataset: "dataset/path"
tag: "dataset-version"
virus_name: "Virus Name"
# ... other parameters
github:
virus-identifier:
repository: "user/repository"
dataset: "dataset/path"
tag: "development"
virus_name: "Virus Name"
# ... other parameters
Nextclade Datasets
Nextclade datasets are official or community datasets available through the nextclade_data repository.
nextclade_data:
denv1:
dataset: "community/v-gen-lab/dengue/denv1"
tag: "2025-04-02--19-11-08Z"
virus_name: "Dengue virus type 1"
virus_tax_id: 11053
virus_species: "Orthoflavivirus denguei"
virus_species_tax_id: 3052464
segment: "Unsegmented"
ncbi_id: "NC_001477.1"
target_gene: "E"
target_regions: ["C", "prM", "E"]
private_mutation_total_threshold: 70
GitHub Datasets
Custom datasets can be hosted in GitHub repositories.
github:
zikav:
repository: "dezordi/nextclade_data_workflows"
dataset: "zikaV/dataset"
tag: "development"
virus_name: "Zika virus"
virus_tax_id: 64320
virus_species: "Orthoflavivirus zikaense"
virus_species_tax_id: 3048459
segment: "Unsegmented"
ncbi_id: "NC_035889.1"
target_gene: "E"
target_regions: ["C", "prM", "E"]
private_mutation_total_threshold: 40
Configuration Parameters
Parameter |
Type |
Description |
|---|---|---|
|
String |
Dataset path in nextclade_data or GitHub repositories |
|
String |
GitHub repository name in |
|
String |
Dataset version/tag or repository branch |
|
String |
Full virus name |
|
Integer |
Virus taxonomic ID in NCBI Taxonomy |
|
String |
Viral species name |
|
Integer |
Species taxonomic ID |
|
String |
Segment name (use “Unsegmented” for non-segmented viruses) |
|
String |
Reference genome accession in NCBI |
|
String |
Target gene/CDS name |
|
List |
List of target genes/CDS |
|
Integer |
Private mutation threshold for quality control |
Note
For non-segmented viruses (e.g., Dengue, Zika, SARS-CoV-2), use "Unsegmented" for the segment field.
For segmented viruses, specify the segment name (e.g., "HA", "NA", "L", "M", "S").