Dataset Configuration

ViralQC uses a configuration file called datasets.yml to define which viruses and datasets are available for analysis. This file is located at viralqc/config/datasets.yml.

datasets.yml File Structure

The file has two main sections:

  1. nextclade_data: Datasets hosted in the official Nextclade repository

  2. github: Custom datasets hosted on GitHub

nextclade_data:
  virus-identifier:
    dataset: "dataset/path"
    tag: "dataset-version"
    virus_name: "Virus Name"
    # ... other parameters

github:
  virus-identifier:
    repository: "user/repository"
    dataset: "dataset/path"
    tag: "development"
    virus_name: "Virus Name"
    # ... other parameters

Nextclade Datasets

Nextclade datasets are official or community datasets available through the nextclade_data repository.

nextclade_data:
  denv1:
    dataset: "community/v-gen-lab/dengue/denv1"
    tag: "2025-04-02--19-11-08Z"
    virus_name: "Dengue virus type 1"
    virus_tax_id: 11053
    virus_species: "Orthoflavivirus denguei"
    virus_species_tax_id: 3052464
    segment: "Unsegmented"
    ncbi_id: "NC_001477.1"
    target_gene: "E"
    target_regions: ["C", "prM", "E"]
    private_mutation_total_threshold: 70

GitHub Datasets

Custom datasets can be hosted in GitHub repositories.

github:
  zikav:
    repository: "dezordi/nextclade_data_workflows"
    dataset: "zikaV/dataset"
    tag: "development"
    virus_name: "Zika virus"
    virus_tax_id: 64320
    virus_species: "Orthoflavivirus zikaense"
    virus_species_tax_id: 3048459
    segment: "Unsegmented"
    ncbi_id: "NC_035889.1"
    target_gene: "E"
    target_regions: ["C", "prM", "E"]
    private_mutation_total_threshold: 40

Configuration Parameters

Parameter

Type

Description

dataset

String

Dataset path in nextclade_data or GitHub repositories

repository

String

GitHub repository name in user/repo format (for github)

tag

String

Dataset version/tag or repository branch

virus_name

String

Full virus name

virus_tax_id

Integer

Virus taxonomic ID in NCBI Taxonomy

virus_species

String

Viral species name

virus_species_tax_id

Integer

Species taxonomic ID

segment

String

Segment name (use “Unsegmented” for non-segmented viruses)

ncbi_id

String

Reference genome accession in NCBI

target_gene

String

Target gene/CDS name

target_regions

List

List of target genes/CDS

private_mutation_total_threshold

Integer

Private mutation threshold for quality control

Note

For non-segmented viruses (e.g., Dengue, Zika, SARS-CoV-2), use "Unsegmented" for the segment field. For segmented viruses, specify the segment name (e.g., "HA", "NA", "L", "M", "S").