# How to Add New Datasets ## Adding a Nextclade Dataset If the dataset is available in the official Nextclade repository: ### Step 1: Identify the Dataset ```bash nextclade dataset list ``` ### Step 2: Edit datasets.yml Add a new entry in `viralqc/config/datasets.yml`: ```yaml nextclade_data: my-new-virus: dataset: "complete/dataset/path" tag: "2025-XX-XX--XX-XX-XXZ" virus_name: "Full Virus Name" virus_tax_id: 123456 virus_species: "Species Name" virus_species_tax_id: 789012 segment: "Unsegmented" ncbi_id: "NC_XXXXXX.X" target_gene: "gene_name" target_regions: ["gene1", "gene2"] private_mutation_total_threshold: 50 ``` ### Step 3: Obtain Taxonomic Information Consult [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy) for: - `virus_tax_id`: Virus taxonomic ID - `virus_species_tax_id`: Species taxonomic ID ### Step 4: Test ```bash vqc get-nextclade-datasets --cores 2 ``` --- ## Adding a GitHub Dataset For custom datasets hosted on GitHub: ### Step 1: Prepare Repository Your repository must contain: ``` your-repository/ └── dataset/path/ ├── reference.fasta ├── genome_annotation.gff3 ├── tree.json (optional) ├── pathogen.json └── sequences.fasta (optional) ``` ### Step 2: Add to datasets.yml ```yaml github: my-custom-virus: repository: "your-user/your-repository" dataset: "path/within/repo" tag: "main" virus_name: "My Custom Virus" virus_tax_id: 123456 virus_species: "Species Name" virus_species_tax_id: 789012 segment: "Unsegmented" ncbi_id: "NC_XXXXXX.X" target_gene: "gene1" target_regions: ["gene1", "gene2"] private_mutation_total_threshold: 40 ``` ### Step 3: Test ```bash vqc get-nextclade-datasets --cores 2 ``` Verify dataset downloaded to `datasets/external_datasets/my-custom-virus/`.