Preparing NCBI Submissions

viralQC includes a prepare-ncbi-submission command to organize sequences and generate metadata CSVs formatted for NCBI submission.

The command contains two sub-commands:

virus: Groups and organizes sequences by virus (or types/segments).
sample: Organizes sequences by viruses but only for individual sample IDs.

Both commands require the output files generated by a viralQC run:

--results: The main ViralQC results file (e.g. results.tsv).
--sequences-vqc: The output target FASTA file (e.g. sequences_target_regions.fasta) generated by viralQC run. This file contains target sequences processed by ViralQC and is prioritized.
--sequences-input: The original input FASTA file you passed to viralQC run. (Note: The command prioritizes sequences found in --sequences-vqc. If a sequence was filtered/dropped by VQC (by lack of quality information) but you still want to submit it, it will be pulled from --sequences-input.)
--output-prefix (optional, default ncbi_submission): Prefix used for the generated output directories.

Grouping by Virus (`virus`)

When using the virus sub-command, the sequences are bundled into directories specific to the virus or segment.

vqc prepare-ncbi-submission virus [SUBCOMMAND] \
    --results results.tsv \
    --sequences sequences_target_regions.fasta

Supported Viruses

For Dengue, Influenza, Norovirus, and SARS-CoV-2, NCBI has specific submission requirements. Each requires specific columns in the metadata.csv file, and these subcommands format the metadata CSV accordingly.

SARS-CoV-2

vqc prepare-ncbi-submission virus sars-cov-2 \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes all SARS-CoV-2 sequences into ncbi_submission_SARS-CoV-2/.

Dengue

vqc prepare-ncbi-submission virus dengue \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes sequences by type, creating directories such as ncbi_submission_Dengue1/, ncbi_submission_Dengue2/, etc., depending on the serotypes identified in the ViralQC analysis.

Influenza

vqc prepare-ncbi-submission virus influenza \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes sequences by type with subdirectories for each segment, creating directories like ncbi_submission_InfluenzaA/HA/, ncbi_submission_InfluenzaA/NA/, ncbi_submission_InfluenzaB/HA/, etc.

Norovirus

vqc prepare-ncbi-submission virus norovirus \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes sequences by genogroup, creating subdirectories such as ncbi_submission_Norovirus/GI/, ncbi_submission_Norovirus/GII/, etc., depending on the genogroups identified in the ViralQC analysis. Supported genogroups are GI through GVI.

Custom Viruses

vqc prepare-ncbi-submission virus custom \
    --virus-name "Respiratory syncytial virus A" \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes all sequences matching the given --virus-name into a single directory. Non-standard viruses automatically get the [Organism=...] qualifier added to their FASTA headers based on the virus_species identified. The name provided should be the same present into the virus field of the results file.

If the virus has annotated segments (e.g., S, M, L), you can pass the --split-by-segments flag to organize sequences into per-segment subdirectories (e.g., ncbi_submission_Oropouche_virus/S/, ncbi_submission_Oropouche_virus/M/, ncbi_submission_Oropouche_virus/L/). Each subdirectory will contain its own sequences.fasta, metadata.tsv, annotation.tbl, and log files. Without this flag, all sequences are placed in a single flat directory.

Additionally, for custom viruses, you can pass --tbl-dir pointing to a folder with per-sample .tbl annotation files. These will be concatenated into a single annotation.tbl file alongside the FASTA sequences. If you don’t provide this option, the command will try to find the .tbl based on tbl_path field in the results file.

Grouping by Sample (`sample`)

If you prefer to organize submissions just for specific samples, or for all samples independently of the virus, use the sample sub-command.

vqc prepare-ncbi-submission sample \
    --sample <sample_id_1> \
    --sample <sample_id_2> \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

You can pass multiple --sample options, or provide a text file with one ID per line via --sample-ids samples.txt.

To process all samples present in the results file, simply use:

vqc prepare-ncbi-submission sample --sample all ...

To process samples based into a list of sample IDs, use the --sample-ids option:

vqc prepare-ncbi-submission sample --sample-ids samples.txt ...

This creates one directory per virus (e.g., ncbi_submission_Dengue1/). If a sequence was skipped or lacked data (commonly tbl files), it is informed in a [prefix]_skipped.tsv file.

Metadata CSV

For all submission commands, you can (or must, for predefined viruses) provide an input metadata CSV file via --metadata.

Input Columns

The input CSV can contain the following columns. Their necessity depends on the virus type being processed:

Required Columns

Column	Description	Dengue, Influenza & Norovirus	SARS-CoV-2	Custom Viruses
`Sequence_ID`	Must match the `seqName` exactly as it appears in the results file and FASTA headers. Must be less than 25 characters long.	Required	Required	Required
`geo_loc_name`	The geographical location of the sample (e.g., Country).	Required	Required	Optional
`host`	The natural host of the virus (e.g., `Homo sapiens`). Do not use special characters.	Required	Required	Optional
`isolate`	The isolate name or identifier string.	Required	Required	Optional
`collection-date`	The date the sample was collected, typically using the `YYYY-MM-DD` format.	Required	Required	Optional
`isolation-source`	The source material of the sample (e.g., `Serum`, `Swab`).	Required	Ignored	Optional

Note: For Custom viruses, creating the --metadata file itself is completely optional. If provided, only Sequence_ID must be present, and the other data columns will be included if you add them.

Optional Columns — Standard Viruses (Dengue, Influenza, Norovirus, SARS-CoV-2)

Any of the following INSDC source modifiers can be added to the metadata CSV. If present, they will be automatically included in the output metadata.csv:

Column (CSV header)	Description
`altitude`	Altitude in metres above or below sea level where the sample was collected.
`collected_by`	Name of the person who collected the sample.
`culture_collection`	Institution code and culture ID (format: `inst:coll:id`).
`haplotype`	Haplotype of the organism.
`lab_host`	Laboratory host used to propagate the organism.
`lat_lon`	Latitude and longitude in decimal degrees (e.g., `15.77 S 47.93 W`).
`note`	Any additional free-text information about the sequence.
`segment`	Name of the viral or phage segment sequenced.
`sex`	Sex of the organism from which the sequence was obtained.
`specimen_voucher`	Institutional identifier for the source specimen.
`strain`	Strain of the organism.
`tissue_type`	Type of tissue from which the sequence was obtained.

Optional Columns — Custom Viruses

For custom viruses the metadata is a tab-delimited file and the column names follow the BankIt Title_Case convention. The following columns can be added to the input CSV using the standard lower-case names — they will be automatically renamed in the output:

Input column	Output column	Description
`altitude`	`Altitude`	Altitude in metres.
`bio_material`	`Bio_material`	Biological material identifier.
`breed`	`Breed`	Named breed (usually for domesticated mammals).
`cell_line`	`Cell_line`	Cell line from which the sequence was obtained.
`cell_type`	`Cell_type`	Type of cell.
`clone`	`Clone`	Clone from which the sequence was obtained.
`collected_by`	`Collected_by`	Person who collected the sample.
`culture_collection`	`Culture_collection`	Culture collection identifier.
`dev_stage`	`Dev_stage`	Developmental stage of the organism.
`ecotype`	`Ecotype`	Named ecotype.
`fwd_primer_name`	`Fwd_primer_name`	Name of forward PCR primer.
`fwd_primer_seq`	`Fwd_primer_seq`	Sequence of forward PCR primer.
`genotype`	`Genotype`	Genotype of the organism.
`haplogroup`	`Haplogroup`	Haplogroup of the organism.
`haplotype`	`Haplotype`	Haplotype of the organism.
`lab_host`	`Lab_host`	Laboratory host used to propagate the organism.
`lat_lon`	`Lat_Lon`	Latitude and longitude in decimal degrees.
`note`	`Note`	Free-text additional information.
`rev_primer_name`	`Rev_primer_name`	Name of reverse PCR primer.
`rev_primer_seq`	`Rev_primer_seq`	Sequence of reverse PCR primer.
`segment`	`Segment`	Viral or phage segment sequenced.
`serotype`	`Serotype`	Serological variety.
`serovar`	`Serovar`	Serological variety (prokaryote).
`sex`	`Sex`	Sex of the organism.
`specimen_voucher`	`Specimen_voucher`	Specimen voucher identifier.
`strain`	`Strain`	Strain of the organism.
`sub_species`	`Sub_species`	Subspecies.
`tissue_lib`	`Tissue_lib`	Tissue library.
`tissue_type`	`Tissue_type`	Type of tissue.
`variety`	`Variety`	Variety of the organism.

Note: Columns not listed above are silently ignored and will not appear in the output metadata file.

Output Format

The prepare-ncbi-submission command will take your input CSV and generate a final metadata file inside each submission directory. For predefined viruses (SARS-CoV-2, Dengue, Influenza, Norovirus), a comma-delimited metadata.csv is generated. For custom viruses, a tab-delimited metadata.tsv is generated instead. The tool automatically enriches this file with taxonomic and typing data derived from the viralQC results:

Influenza: Adds serotype (e.g., H1N1 or H3N2) extracted directly from the classification.
Dengue: Adds genotype (e.g., 1, 2, 3, or 4) and serotype (the detailed clade assignment).
Norovirus: Adds genotype (e.g., GII, GII.17) extracted from the virus classification. No serotype column is included.
SARS-CoV-2: Removes isolation-source and serotype as they are not typically included in SC2 NCBI submissions.
Custom Viruses: Renames columns to NCBI-compatible names: geo_loc_name → Country (geo_loc_name), host → Host, isolate → Isolate, collection-date → Collection_date, isolation-source → Isolation_source. No Organism column is added (the organism information is included in the FASTA headers as [Organism=...]).

FASTA Headers and Annotations

FASTA headers are carefully managed during organization:

Sequences failing quality thresholds (e.g. length < 150nt, or N content ≥ 50%) are excluded from the FASTA. The reason for each exclusion is recorded in the summary.txt file inside the relevant output directory.
Case-insensitive duplicate sequence IDs are automatically detected and removed. NCBI treats sequence IDs as case-insensitive — for example, SEQ001 and seq001 are considered identical by NCBI and would cause an upload error. When a clash is detected, the second occurrence is dropped and the reason is logged in summary.txt:
```
dropped: 1 sequence(s)
  - seq001: Case-insensitive duplicate of 'SEQ001' (NCBI treats IDs as case-insensitive)
```
Unsafe characters in sequence names (non-ASCII or pipes) are sanitized to underscores for NCBI compatibility, with translations logged in renamed_headers.tsv.
Spaces and brackets are preserved correctly, allowing standard NCBI feature qualifiers like [Organism=...] to work as intended for non-standard viruses.

Batch Splitting

NCBI limits submissions to 3,000 sequences per file. When a virus group exceeds this limit, the sequences.fasta and metadata files are automatically split into numbered batches:

sequences.1.fasta, metadata.1.csv (or metadata.1.tsv for custom viruses) — first 2,999 sequences
sequences.2.fasta, metadata.2.csv (or metadata.2.tsv) — next 2,999 sequences
…and so on.

If a group has 2,999 or fewer sequences, the files are written normally without any suffix.

Python API

The preparation logic is also available as an importable Python class, PrepareSubmission, for use in third-party scripts or pipelines. Instead of reading metadata from a CSV file, the class accepts a list of Python dicts.

Installation

The class is available after installing viralqc as a package. No additional dependencies are required.

from viralqc.core import PrepareSubmission

Constructor

PrepareSubmission(
    viralqc_results,      # Path – ViralQC results file (.tsv, .csv or .json)
    viralqc_target_seq,   # Path – sequences_target_regions.fasta produced by vqc run
    viralqc_input_seq,    # Path – original input FASTA passed to vqc run
    samples_metadata,     # list[dict] – sample metadata (see below)
    output_prefix="ncbi_submission",  # str  – prefix for output directories
    split_by_segments=False,          # bool – split custom viruses by segment
    tbl_dir=None,                     # Path|None – folder with per-sample .tbl files
)

Each dict in samples_metadata uses the following keys:

Key	Description	Standard viruses	Custom viruses
`sample_id`	Must match `seqName` exactly. Max 24 characters.	Required	Required
`country`	Geographic location (maps to `geo_loc_name`).	Required	Optional
`host`	Natural host (e.g. `Homo sapiens`).	Required	Optional
`isolate`	Isolate name or identifier.	Required	Optional
`collection-date`	Collection date (`YYYY-MM-DD`).	Required	Optional
`isolation-source`	Source material (e.g. `Serum`).	Required	Optional

Methods

`run_virus(virus="all", virus_name=None)`

Prepares submission packages grouped by virus type. Equivalent to vqc prepare-ncbi-submission virus <subcommand>.

virus: "all" (default), "sars-cov-2", "dengue", "influenza", "norovirus", or "custom".
virus_name: required when virus="custom".

`run_sample(samples=["all"])`

Prepares packages for specific samples or all samples. Equivalent to vqc prepare-ncbi-submission sample.

samples: a list of sample IDs, or ["all"] to process every sample.

Return Value

Both methods return a list of dicts, one entry per generated output directory:

[
    {
        "SARS-CoV-2": {
            "sequences":  [Path("ncbi_submission_SARS-CoV-2/sequences.fasta")],
            "metadata":   [Path("ncbi_submission_SARS-CoV-2/metadata.csv")],
            "log":         Path("ncbi_submission_SARS-CoV-2/summary.txt"),
        }
    },
    {
        "Oropouche virus": {
            "sequences":  [Path("ncbi_submission_Oropouche_virus/sequences.fasta")],
            "metadata":   [Path("ncbi_submission_Oropouche_virus/metadata.tsv")],
            "log":         Path("ncbi_submission_Oropouche_virus/summary.txt"),
            "annotation": [Path("ncbi_submission_Oropouche_virus/annotation.tbl")],
        }
    },
]

The "annotation" key is only present for custom viruses that have TBL files.

For viruses organized into subdirectories (Influenza segments, Norovirus genogroups, custom viruses with split_by_segments=True), each subdirectory produces its own entry. The label uses a "Type/Subgroup" format, e.g. "InfluenzaA/HA" or "Norovirus/GII".

Examples

Process all viruses found in the results file

from pathlib import Path
from viralqc.core import PrepareSubmission

ps = PrepareSubmission(
    viralqc_results=Path("results.tsv"),
    viralqc_target_seq=Path("sequences_target_regions.fasta"),
    viralqc_input_seq=Path("sequences.fasta"),
    samples_metadata=[
        {
            "sample_id": "S001",
            "country": "Brazil",
            "host": "Homo sapiens",
            "isolate": "isolate/S001/2024",
            "collection-date": "2024-01-01",
            "isolation-source": "Serum",
        },
        {
            "sample_id": "S002",
            "country": "Colombia",
            "host": "Homo sapiens",
            "isolate": "isolate/S002/2024",
            "collection-date": "2024-02-15",
            "isolation-source": "Nasopharyngeal swab",
        },
    ],
)

results = ps.run_virus()  # process all virus groups
for entry in results:
    for virus_label, files in entry.items():
        print(f"{virus_label}:")
        for seq_path in files["sequences"]:
            print(f"  sequences → {seq_path}")
        for meta_path in files["metadata"]:
            print(f"  metadata  → {meta_path}")
        if "annotation" in files:
            for ann_path in files["annotation"]:
                print(f"  annotation → {ann_path}")

Process only specific sample IDs

results = ps.run_sample(samples=["S001"])

Process a custom virus with segment splitting

ps = PrepareSubmission(
    viralqc_results=Path("results.tsv"),
    viralqc_target_seq=Path("sequences_target_regions.fasta"),
    viralqc_input_seq=Path("sequences.fasta"),
    samples_metadata=[...],
    split_by_segments=True,
)
results = ps.run_virus(virus="custom", virus_name="Oropouche virus")
# Produces ncbi_submission_Oropouche_virus/S/, /M/, /L/ subdirectories