How to Add New Datasets

Adding a Nextclade Dataset

If the dataset is available in the official Nextclade repository:

Step 1: Identify the Dataset

nextclade dataset list

Step 2: Edit datasets.yml

Add a new entry in viralqc/config/datasets.yml:

nextclade_data:
  my-new-virus:
    dataset: "complete/dataset/path"
    tag: "2025-XX-XX--XX-XX-XXZ"
    virus_name: "Full Virus Name"
    virus_tax_id: 123456
    virus_species: "Species Name"
    virus_species_tax_id: 789012
    segment: "Unsegmented"
    ncbi_id: "NC_XXXXXX.X"
    target_gene: "gene_name"
    target_regions: ["gene1", "gene2"]
    private_mutation_total_threshold: 50

Step 3: Obtain Taxonomic Information

Consult NCBI Taxonomy for:

  • virus_tax_id: Virus taxonomic ID

  • virus_species_tax_id: Species taxonomic ID

Step 4: Test

vqc get-nextclade-datasets --cores 2

Adding a GitHub Dataset

For custom datasets hosted on GitHub:

Step 1: Prepare Repository

Your repository must contain:

your-repository/
└── dataset/path/
    ├── reference.fasta
    ├── genome_annotation.gff3
    ├── tree.json (optional)
    ├── pathogen.json
    └── sequences.fasta (optional)

Step 2: Add to datasets.yml

github:
  my-custom-virus:
    repository: "your-user/your-repository"
    dataset: "path/within/repo"
    tag: "main"
    virus_name: "My Custom Virus"
    virus_tax_id: 123456
    virus_species: "Species Name"
    virus_species_tax_id: 789012
    segment: "Unsegmented"
    ncbi_id: "NC_XXXXXX.X"
    target_gene: "gene1"
    target_regions: ["gene1", "gene2"]
    private_mutation_total_threshold: 40

Step 3: Test

vqc get-nextclade-datasets --cores 2

Verify dataset downloaded to datasets/external_datasets/my-custom-virus/.