Preparing NCBI Submissions

viralQC includes a prepare-ncbi-submission command to organize sequences and generate metadata CSVs formatted for NCBI submission.

The command contains two sub-commands:

  1. virus: Groups and organizes sequences by virus (or types/segments).

  2. sample: Organizes sequences by viruses but only for individual sample IDs.

Both commands require the output files generated by a viralQC run:

  • --results: The main ViralQC results file (e.g. results.tsv).

  • --sequences-vqc: The output target FASTA file (e.g. sequences_target_regions.fasta) generated by viralQC run. This file contains target sequences processed by ViralQC and is prioritized.

  • --sequences-input: The original input FASTA file you passed to viralQC run. (Note: The command prioritizes sequences found in --sequences-vqc. If a sequence was filtered/dropped by VQC (by lack of quality information) but you still want to submit it, it will be pulled from --sequences-input.)

  • --output-prefix (optional, default ncbi_submission): Prefix used for the generated output directories.

Grouping by Virus (virus)

When using the virus sub-command, the sequences are bundled into directories specific to the virus or segment.

vqc prepare-ncbi-submission virus [SUBCOMMAND] \
    --results results.tsv \
    --sequences sequences_target_regions.fasta

Supported Viruses

For Dengue, Influenza, Norovirus, and SARS-CoV-2, NCBI has specific submission requirements. Each requires specific columns in the metadata.csv file, and these subcommands format the metadata CSV accordingly.

SARS-CoV-2

vqc prepare-ncbi-submission virus sars-cov-2 \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes all SARS-CoV-2 sequences into ncbi_submission_SARS-CoV-2/.

Dengue

vqc prepare-ncbi-submission virus dengue \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes sequences by type, creating directories such as ncbi_submission_Dengue1/, ncbi_submission_Dengue2/, etc., depending on the serotypes identified in the ViralQC analysis.

Influenza

vqc prepare-ncbi-submission virus influenza \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes sequences by type with subdirectories for each segment, creating directories like ncbi_submission_InfluenzaA/HA/, ncbi_submission_InfluenzaA/NA/, ncbi_submission_InfluenzaB/HA/, etc.

Norovirus

vqc prepare-ncbi-submission virus norovirus \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes sequences by genogroup, creating subdirectories such as ncbi_submission_Norovirus/GI/, ncbi_submission_Norovirus/GII/, etc., depending on the genogroups identified in the ViralQC analysis. Supported genogroups are GI through GVI.

Custom Viruses

vqc prepare-ncbi-submission virus custom \
    --virus-name "Respiratory syncytial virus A" \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

Organizes all sequences matching the given --virus-name into a single directory. Non-standard viruses automatically get the [Organism=...] qualifier added to their FASTA headers based on the virus_species identified. The name provided should be the same present into the virus field of the results file.

If the virus has annotated segments (e.g., S, M, L), you can pass the --split-by-segments flag to organize sequences into per-segment subdirectories (e.g., ncbi_submission_Oropouche_virus/S/, ncbi_submission_Oropouche_virus/M/, ncbi_submission_Oropouche_virus/L/). Each subdirectory will contain its own sequences.fasta, metadata.tsv, annotation.tbl, and log files. Without this flag, all sequences are placed in a single flat directory.

Additionally, for custom viruses, you can pass --tbl-dir pointing to a folder with per-sample .tbl annotation files. These will be concatenated into a single annotation.tbl file alongside the FASTA sequences. If you don’t provide this option, the command will try to find the .tbl based on tbl_path field in the results file.

Grouping by Sample (sample)

If you prefer to organize submissions just for specific samples, or for all samples independently of the virus, use the sample sub-command.

vqc prepare-ncbi-submission sample \
    --sample <sample_id_1> \
    --sample <sample_id_2> \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv

You can pass multiple --sample options, or provide a text file with one ID per line via --sample-ids samples.txt.

To process all samples present in the results file, simply use:

vqc prepare-ncbi-submission sample --sample all ...

To process samples based into a list of sample IDs, use the --sample-ids option:

vqc prepare-ncbi-submission sample --sample-ids samples.txt ...

This creates one directory per virus (e.g., ncbi_submission_Dengue1/). If a sequence was skipped or lacked data (commonly tbl files), it is informed in a [prefix]_skipped.tsv file.

Metadata CSV

For all submission commands, you can (or must, for predefined viruses) provide an input metadata CSV file via --metadata.

Input Columns

The input CSV can contain the following columns. Their necessity depends on the virus type being processed:

Required Columns

Column

Description

Dengue, Influenza & Norovirus

SARS-CoV-2

Custom Viruses

Sequence_ID

Must match the seqName exactly as it appears in the results file and FASTA headers. Must be less than 25 characters long.

Required

Required

Required

geo_loc_name

The geographical location of the sample (e.g., Country).

Required

Required

Optional

host

The natural host of the virus (e.g., Homo sapiens). Do not use special characters.

Required

Required

Optional

isolate

The isolate name or identifier string.

Required

Required

Optional

collection-date

The date the sample was collected, typically using the YYYY-MM-DD format.

Required

Required

Optional

isolation-source

The source material of the sample (e.g., Serum, Swab).

Required

Ignored

Optional

Note: For Custom viruses, creating the --metadata file itself is completely optional. If provided, only Sequence_ID must be present, and the other data columns will be included if you add them.

Optional Columns — Standard Viruses (Dengue, Influenza, Norovirus, SARS-CoV-2)

Any of the following INSDC source modifiers can be added to the metadata CSV. If present, they will be automatically included in the output metadata.csv:

Column (CSV header)

Description

altitude

Altitude in metres above or below sea level where the sample was collected.

collected_by

Name of the person who collected the sample.

culture_collection

Institution code and culture ID (format: inst:coll:id).

haplotype

Haplotype of the organism.

lab_host

Laboratory host used to propagate the organism.

lat_lon

Latitude and longitude in decimal degrees (e.g., 15.77 S 47.93 W).

note

Any additional free-text information about the sequence.

segment

Name of the viral or phage segment sequenced.

sex

Sex of the organism from which the sequence was obtained.

specimen_voucher

Institutional identifier for the source specimen.

strain

Strain of the organism.

tissue_type

Type of tissue from which the sequence was obtained.

Optional Columns — Custom Viruses

For custom viruses the metadata is a tab-delimited file and the column names follow the BankIt Title_Case convention. The following columns can be added to the input CSV using the standard lower-case names — they will be automatically renamed in the output:

Input column

Output column

Description

altitude

Altitude

Altitude in metres.

bio_material

Bio_material

Biological material identifier.

breed

Breed

Named breed (usually for domesticated mammals).

cell_line

Cell_line

Cell line from which the sequence was obtained.

cell_type

Cell_type

Type of cell.

clone

Clone

Clone from which the sequence was obtained.

collected_by

Collected_by

Person who collected the sample.

culture_collection

Culture_collection

Culture collection identifier.

dev_stage

Dev_stage

Developmental stage of the organism.

ecotype

Ecotype

Named ecotype.

fwd_primer_name

Fwd_primer_name

Name of forward PCR primer.

fwd_primer_seq

Fwd_primer_seq

Sequence of forward PCR primer.

genotype

Genotype

Genotype of the organism.

haplogroup

Haplogroup

Haplogroup of the organism.

haplotype

Haplotype

Haplotype of the organism.

lab_host

Lab_host

Laboratory host used to propagate the organism.

lat_lon

Lat_Lon

Latitude and longitude in decimal degrees.

note

Note

Free-text additional information.

rev_primer_name

Rev_primer_name

Name of reverse PCR primer.

rev_primer_seq

Rev_primer_seq

Sequence of reverse PCR primer.

segment

Segment

Viral or phage segment sequenced.

serotype

Serotype

Serological variety.

serovar

Serovar

Serological variety (prokaryote).

sex

Sex

Sex of the organism.

specimen_voucher

Specimen_voucher

Specimen voucher identifier.

strain

Strain

Strain of the organism.

sub_species

Sub_species

Subspecies.

tissue_lib

Tissue_lib

Tissue library.

tissue_type

Tissue_type

Type of tissue.

variety

Variety

Variety of the organism.

Note: Columns not listed above are silently ignored and will not appear in the output metadata file.

Output Format

The prepare-ncbi-submission command will take your input CSV and generate a final metadata file inside each submission directory. For predefined viruses (SARS-CoV-2, Dengue, Influenza, Norovirus), a comma-delimited metadata.csv is generated. For custom viruses, a tab-delimited metadata.tsv is generated instead. The tool automatically enriches this file with taxonomic and typing data derived from the viralQC results:

  • Influenza: Adds serotype (e.g., H1N1 or H3N2) extracted directly from the classification.

  • Dengue: Adds genotype (e.g., 1, 2, 3, or 4) and serotype (the detailed clade assignment).

  • Norovirus: Adds genotype (e.g., GII, GII.17) extracted from the virus classification. No serotype column is included.

  • SARS-CoV-2: Removes isolation-source and serotype as they are not typically included in SC2 NCBI submissions.

  • Custom Viruses: Renames columns to NCBI-compatible names: geo_loc_nameCountry (geo_loc_name), hostHost, isolateIsolate, collection-dateCollection_date, isolation-sourceIsolation_source. No Organism column is added (the organism information is included in the FASTA headers as [Organism=...]).

FASTA Headers and Annotations

FASTA headers are carefully managed during organization:

  • Sequences failing quality thresholds (e.g. length < 150nt, or N content ≥ 50%) are excluded from the FASTA. The reason for each exclusion is recorded in the summary.txt file inside the relevant output directory.

  • Case-insensitive duplicate sequence IDs are automatically detected and removed. NCBI treats sequence IDs as case-insensitive — for example, SEQ001 and seq001 are considered identical by NCBI and would cause an upload error. When a clash is detected, the second occurrence is dropped and the reason is logged in summary.txt:

    dropped: 1 sequence(s)
      - seq001: Case-insensitive duplicate of 'SEQ001' (NCBI treats IDs as case-insensitive)
    
  • Unsafe characters in sequence names (non-ASCII or pipes) are sanitized to underscores for NCBI compatibility, with translations logged in renamed_headers.tsv.

  • Spaces and brackets are preserved correctly, allowing standard NCBI feature qualifiers like [Organism=...] to work as intended for non-standard viruses.

Batch Splitting

NCBI limits submissions to 3,000 sequences per file. When a virus group exceeds this limit, the sequences.fasta and metadata files are automatically split into numbered batches:

  • sequences.1.fasta, metadata.1.csv (or metadata.1.tsv for custom viruses) — first 2,999 sequences

  • sequences.2.fasta, metadata.2.csv (or metadata.2.tsv) — next 2,999 sequences

  • …and so on.

If a group has 2,999 or fewer sequences, the files are written normally without any suffix.

Python API

The preparation logic is also available as an importable Python class, PrepareSubmission, for use in third-party scripts or pipelines. Instead of reading metadata from a CSV file, the class accepts a list of Python dicts.

Installation

The class is available after installing viralqc as a package. No additional dependencies are required.

from viralqc.core import PrepareSubmission

Constructor

PrepareSubmission(
    viralqc_results,      # Path – ViralQC results file (.tsv, .csv or .json)
    viralqc_target_seq,   # Path – sequences_target_regions.fasta produced by vqc run
    viralqc_input_seq,    # Path – original input FASTA passed to vqc run
    samples_metadata,     # list[dict] – sample metadata (see below)
    output_prefix="ncbi_submission",  # str  – prefix for output directories
    split_by_segments=False,          # bool – split custom viruses by segment
    tbl_dir=None,                     # Path|None – folder with per-sample .tbl files
)

Each dict in samples_metadata uses the following keys:

Key

Description

Standard viruses

Custom viruses

sample_id

Must match seqName exactly. Max 24 characters.

Required

Required

country

Geographic location (maps to geo_loc_name).

Required

Optional

host

Natural host (e.g. Homo sapiens).

Required

Optional

isolate

Isolate name or identifier.

Required

Optional

collection-date

Collection date (YYYY-MM-DD).

Required

Optional

isolation-source

Source material (e.g. Serum).

Required

Optional

Methods

run_virus(virus="all", virus_name=None)

Prepares submission packages grouped by virus type. Equivalent to vqc prepare-ncbi-submission virus <subcommand>.

  • virus: "all" (default), "sars-cov-2", "dengue", "influenza", "norovirus", or "custom".

  • virus_name: required when virus="custom".

run_sample(samples=["all"])

Prepares packages for specific samples or all samples. Equivalent to vqc prepare-ncbi-submission sample.

  • samples: a list of sample IDs, or ["all"] to process every sample.

Return Value

Both methods return a list of dicts, one entry per generated output directory:

[
    {
        "SARS-CoV-2": {
            "sequences":  [Path("ncbi_submission_SARS-CoV-2/sequences.fasta")],
            "metadata":   [Path("ncbi_submission_SARS-CoV-2/metadata.csv")],
            "log":         Path("ncbi_submission_SARS-CoV-2/summary.txt"),
        }
    },
    {
        "Oropouche virus": {
            "sequences":  [Path("ncbi_submission_Oropouche_virus/sequences.fasta")],
            "metadata":   [Path("ncbi_submission_Oropouche_virus/metadata.tsv")],
            "log":         Path("ncbi_submission_Oropouche_virus/summary.txt"),
            "annotation": [Path("ncbi_submission_Oropouche_virus/annotation.tbl")],
        }
    },
]

The "annotation" key is only present for custom viruses that have TBL files.

For viruses organized into subdirectories (Influenza segments, Norovirus genogroups, custom viruses with split_by_segments=True), each subdirectory produces its own entry. The label uses a "Type/Subgroup" format, e.g. "InfluenzaA/HA" or "Norovirus/GII".

Examples

Process all viruses found in the results file

from pathlib import Path
from viralqc.core import PrepareSubmission

ps = PrepareSubmission(
    viralqc_results=Path("results.tsv"),
    viralqc_target_seq=Path("sequences_target_regions.fasta"),
    viralqc_input_seq=Path("sequences.fasta"),
    samples_metadata=[
        {
            "sample_id": "S001",
            "country": "Brazil",
            "host": "Homo sapiens",
            "isolate": "isolate/S001/2024",
            "collection-date": "2024-01-01",
            "isolation-source": "Serum",
        },
        {
            "sample_id": "S002",
            "country": "Colombia",
            "host": "Homo sapiens",
            "isolate": "isolate/S002/2024",
            "collection-date": "2024-02-15",
            "isolation-source": "Nasopharyngeal swab",
        },
    ],
)

results = ps.run_virus()  # process all virus groups
for entry in results:
    for virus_label, files in entry.items():
        print(f"{virus_label}:")
        for seq_path in files["sequences"]:
            print(f"  sequences → {seq_path}")
        for meta_path in files["metadata"]:
            print(f"  metadata  → {meta_path}")
        if "annotation" in files:
            for ann_path in files["annotation"]:
                print(f"  annotation → {ann_path}")

Process only specific sample IDs

results = ps.run_sample(samples=["S001"])

Process a custom virus with segment splitting

ps = PrepareSubmission(
    viralqc_results=Path("results.tsv"),
    viralqc_target_seq=Path("sequences_target_regions.fasta"),
    viralqc_input_seq=Path("sequences.fasta"),
    samples_metadata=[...],
    split_by_segments=True,
)
results = ps.run_virus(virus="custom", virus_name="Oropouche virus")
# Produces ncbi_submission_Oropouche_virus/S/, /M/, /L/ subdirectories