Preparing NCBI Submissions
viralQC includes a prepare-ncbi-submission command to organize sequences and generate metadata CSVs formatted for NCBI submission.
The command contains two sub-commands:
virus: Groups and organizes sequences by virus (or types/segments).sample: Organizes sequences by viruses but only for individual sample IDs.
Both commands require the output files generated by a viralQC run:
--results: The main ViralQC results file (e.g.results.tsv).--sequences-vqc: The output target FASTA file (e.g.sequences_target_regions.fasta) generated byviralQC run. This file contains target sequences processed by ViralQC and is prioritized.--sequences-input: The original input FASTA file you passed toviralQC run. (Note: The command prioritizes sequences found in--sequences-vqc. If a sequence was filtered/dropped by VQC (by lack of quality information) but you still want to submit it, it will be pulled from--sequences-input.)--output-prefix(optional, defaultncbi_submission): Prefix used for the generated output directories.
Grouping by Virus (virus)
When using the virus sub-command, the sequences are bundled into directories specific to the virus or segment.
vqc prepare-ncbi-submission virus [SUBCOMMAND] \
--results results.tsv \
--sequences sequences_target_regions.fasta
Supported Viruses
For Dengue, Influenza, Norovirus, and SARS-CoV-2, NCBI has specific submission requirements. Each requires specific columns in the metadata.csv file, and these subcommands format the metadata CSV accordingly.
SARS-CoV-2
vqc prepare-ncbi-submission virus sars-cov-2 \
--results results.tsv \
--sequences-vqc sequences_target_regions.fasta \
--sequences-input original_sequences.fasta \
--metadata input_metadata.csv
Organizes all SARS-CoV-2 sequences into ncbi_submission_SARS-CoV-2/.
Dengue
vqc prepare-ncbi-submission virus dengue \
--results results.tsv \
--sequences-vqc sequences_target_regions.fasta \
--sequences-input original_sequences.fasta \
--metadata input_metadata.csv
Organizes sequences by type, creating directories such as ncbi_submission_Dengue1/, ncbi_submission_Dengue2/, etc., depending on the serotypes identified in the ViralQC analysis.
Influenza
vqc prepare-ncbi-submission virus influenza \
--results results.tsv \
--sequences-vqc sequences_target_regions.fasta \
--sequences-input original_sequences.fasta \
--metadata input_metadata.csv
Organizes sequences by type with subdirectories for each segment, creating directories like ncbi_submission_InfluenzaA/HA/, ncbi_submission_InfluenzaA/NA/, ncbi_submission_InfluenzaB/HA/, etc.
Norovirus
vqc prepare-ncbi-submission virus norovirus \
--results results.tsv \
--sequences-vqc sequences_target_regions.fasta \
--sequences-input original_sequences.fasta \
--metadata input_metadata.csv
Organizes sequences by genogroup, creating subdirectories such as ncbi_submission_Norovirus/GI/, ncbi_submission_Norovirus/GII/, etc., depending on the genogroups identified in the ViralQC analysis. Supported genogroups are GI through GVI.
Custom Viruses
vqc prepare-ncbi-submission virus custom \
--virus-name "Respiratory syncytial virus A" \
--results results.tsv \
--sequences-vqc sequences_target_regions.fasta \
--sequences-input original_sequences.fasta \
--metadata input_metadata.csv
Organizes all sequences matching the given --virus-name into a single directory. Non-standard viruses automatically get the [Organism=...] qualifier added to their FASTA headers based on the virus_species identified. The name provided should be the same present into the virus field of the results file.
If the virus has annotated segments (e.g., S, M, L), you can pass the --split-by-segments flag to organize sequences into per-segment subdirectories (e.g., ncbi_submission_Oropouche_virus/S/, ncbi_submission_Oropouche_virus/M/, ncbi_submission_Oropouche_virus/L/). Each subdirectory will contain its own sequences.fasta, metadata.tsv, annotation.tbl, and log files. Without this flag, all sequences are placed in a single flat directory.
Additionally, for custom viruses, you can pass --tbl-dir pointing to a folder with per-sample .tbl annotation files. These will be concatenated into a single annotation.tbl file alongside the FASTA sequences. If you don’t provide this option, the command will try to find the .tbl based on tbl_path field in the results file.
Grouping by Sample (sample)
If you prefer to organize submissions just for specific samples, or for all samples independently of the virus, use the sample sub-command.
vqc prepare-ncbi-submission sample \
--sample <sample_id_1> \
--sample <sample_id_2> \
--results results.tsv \
--sequences-vqc sequences_target_regions.fasta \
--sequences-input original_sequences.fasta \
--metadata input_metadata.csv
You can pass multiple --sample options, or provide a text file with one ID per line via --sample-ids samples.txt.
To process all samples present in the results file, simply use:
vqc prepare-ncbi-submission sample --sample all ...
To process samples based into a list of sample IDs, use the --sample-ids option:
vqc prepare-ncbi-submission sample --sample-ids samples.txt ...
This creates one directory per virus (e.g., ncbi_submission_Dengue1/). If a sequence was skipped or lacked data (commonly tbl files), it is informed in a [prefix]_skipped.tsv file.
Metadata CSV
For all submission commands, you can (or must, for predefined viruses) provide an input metadata CSV file via --metadata.
Input Columns
The input CSV can contain the following columns. Their necessity depends on the virus type being processed:
Required Columns
Column |
Description |
Dengue, Influenza & Norovirus |
SARS-CoV-2 |
Custom Viruses |
|---|---|---|---|---|
|
Must match the |
Required |
Required |
Required |
|
The geographical location of the sample (e.g., Country). |
Required |
Required |
Optional |
|
The natural host of the virus (e.g., |
Required |
Required |
Optional |
|
The isolate name or identifier string. |
Required |
Required |
Optional |
|
The date the sample was collected, typically using the |
Required |
Required |
Optional |
|
The source material of the sample (e.g., |
Required |
Ignored |
Optional |
Note: For Custom viruses, creating the --metadata file itself is completely optional. If provided, only Sequence_ID must be present, and the other data columns will be included if you add them.
Optional Columns — Standard Viruses (Dengue, Influenza, Norovirus, SARS-CoV-2)
Any of the following INSDC source modifiers can be added to the metadata CSV. If present, they will be automatically included in the output metadata.csv:
Column (CSV header) |
Description |
|---|---|
|
Altitude in metres above or below sea level where the sample was collected. |
|
Name of the person who collected the sample. |
|
Institution code and culture ID (format: |
|
Haplotype of the organism. |
|
Laboratory host used to propagate the organism. |
|
Latitude and longitude in decimal degrees (e.g., |
|
Any additional free-text information about the sequence. |
|
Name of the viral or phage segment sequenced. |
|
Sex of the organism from which the sequence was obtained. |
|
Institutional identifier for the source specimen. |
|
Strain of the organism. |
|
Type of tissue from which the sequence was obtained. |
Optional Columns — Custom Viruses
For custom viruses the metadata is a tab-delimited file and the column names follow the BankIt Title_Case convention. The following columns can be added to the input CSV using the standard lower-case names — they will be automatically renamed in the output:
Input column |
Output column |
Description |
|---|---|---|
|
|
Altitude in metres. |
|
|
Biological material identifier. |
|
|
Named breed (usually for domesticated mammals). |
|
|
Cell line from which the sequence was obtained. |
|
|
Type of cell. |
|
|
Clone from which the sequence was obtained. |
|
|
Person who collected the sample. |
|
|
Culture collection identifier. |
|
|
Developmental stage of the organism. |
|
|
Named ecotype. |
|
|
Name of forward PCR primer. |
|
|
Sequence of forward PCR primer. |
|
|
Genotype of the organism. |
|
|
Haplogroup of the organism. |
|
|
Haplotype of the organism. |
|
|
Laboratory host used to propagate the organism. |
|
|
Latitude and longitude in decimal degrees. |
|
|
Free-text additional information. |
|
|
Name of reverse PCR primer. |
|
|
Sequence of reverse PCR primer. |
|
|
Viral or phage segment sequenced. |
|
|
Serological variety. |
|
|
Serological variety (prokaryote). |
|
|
Sex of the organism. |
|
|
Specimen voucher identifier. |
|
|
Strain of the organism. |
|
|
Subspecies. |
|
|
Tissue library. |
|
|
Type of tissue. |
|
|
Variety of the organism. |
Note: Columns not listed above are silently ignored and will not appear in the output metadata file.
Output Format
The prepare-ncbi-submission command will take your input CSV and generate a final metadata file inside each submission directory. For predefined viruses (SARS-CoV-2, Dengue, Influenza, Norovirus), a comma-delimited metadata.csv is generated. For custom viruses, a tab-delimited metadata.tsv is generated instead. The tool automatically enriches this file with taxonomic and typing data derived from the viralQC results:
Influenza: Adds
serotype(e.g.,H1N1orH3N2) extracted directly from the classification.Dengue: Adds
genotype(e.g.,1,2,3, or4) andserotype(the detailed clade assignment).Norovirus: Adds
genotype(e.g.,GII,GII.17) extracted from the virus classification. Noserotypecolumn is included.SARS-CoV-2: Removes
isolation-sourceandserotypeas they are not typically included in SC2 NCBI submissions.Custom Viruses: Renames columns to NCBI-compatible names:
geo_loc_name→Country (geo_loc_name),host→Host,isolate→Isolate,collection-date→Collection_date,isolation-source→Isolation_source. NoOrganismcolumn is added (the organism information is included in the FASTA headers as[Organism=...]).
FASTA Headers and Annotations
FASTA headers are carefully managed during organization:
Sequences failing quality thresholds (e.g. length < 150nt, or N content ≥ 50%) are excluded from the FASTA. The reason for each exclusion is recorded in the
summary.txtfile inside the relevant output directory.Case-insensitive duplicate sequence IDs are automatically detected and removed. NCBI treats sequence IDs as case-insensitive — for example,
SEQ001andseq001are considered identical by NCBI and would cause an upload error. When a clash is detected, the second occurrence is dropped and the reason is logged insummary.txt:dropped: 1 sequence(s) - seq001: Case-insensitive duplicate of 'SEQ001' (NCBI treats IDs as case-insensitive)
Unsafe characters in sequence names (non-ASCII or pipes) are sanitized to underscores for NCBI compatibility, with translations logged in
renamed_headers.tsv.Spaces and brackets are preserved correctly, allowing standard NCBI feature qualifiers like
[Organism=...]to work as intended for non-standard viruses.
Batch Splitting
NCBI limits submissions to 3,000 sequences per file. When a virus group exceeds this limit, the sequences.fasta and metadata files are automatically split into numbered batches:
sequences.1.fasta,metadata.1.csv(ormetadata.1.tsvfor custom viruses) — first 2,999 sequencessequences.2.fasta,metadata.2.csv(ormetadata.2.tsv) — next 2,999 sequences…and so on.
If a group has 2,999 or fewer sequences, the files are written normally without any suffix.
Python API
The preparation logic is also available as an importable Python class, PrepareSubmission, for use in third-party scripts or pipelines. Instead of reading metadata from a CSV file, the class accepts a list of Python dicts.
Installation
The class is available after installing viralqc as a package. No additional dependencies are required.
from viralqc.core import PrepareSubmission
Constructor
PrepareSubmission(
viralqc_results, # Path – ViralQC results file (.tsv, .csv or .json)
viralqc_target_seq, # Path – sequences_target_regions.fasta produced by vqc run
viralqc_input_seq, # Path – original input FASTA passed to vqc run
samples_metadata, # list[dict] – sample metadata (see below)
output_prefix="ncbi_submission", # str – prefix for output directories
split_by_segments=False, # bool – split custom viruses by segment
tbl_dir=None, # Path|None – folder with per-sample .tbl files
)
Each dict in samples_metadata uses the following keys:
Key |
Description |
Standard viruses |
Custom viruses |
|---|---|---|---|
|
Must match |
Required |
Required |
|
Geographic location (maps to |
Required |
Optional |
|
Natural host (e.g. |
Required |
Optional |
|
Isolate name or identifier. |
Required |
Optional |
|
Collection date ( |
Required |
Optional |
|
Source material (e.g. |
Required |
Optional |
Methods
run_virus(virus="all", virus_name=None)
Prepares submission packages grouped by virus type. Equivalent to vqc prepare-ncbi-submission virus <subcommand>.
virus:"all"(default),"sars-cov-2","dengue","influenza","norovirus", or"custom".virus_name: required whenvirus="custom".
run_sample(samples=["all"])
Prepares packages for specific samples or all samples. Equivalent to vqc prepare-ncbi-submission sample.
samples: a list of sample IDs, or["all"]to process every sample.
Return Value
Both methods return a list of dicts, one entry per generated output directory:
[
{
"SARS-CoV-2": {
"sequences": [Path("ncbi_submission_SARS-CoV-2/sequences.fasta")],
"metadata": [Path("ncbi_submission_SARS-CoV-2/metadata.csv")],
"log": Path("ncbi_submission_SARS-CoV-2/summary.txt"),
}
},
{
"Oropouche virus": {
"sequences": [Path("ncbi_submission_Oropouche_virus/sequences.fasta")],
"metadata": [Path("ncbi_submission_Oropouche_virus/metadata.tsv")],
"log": Path("ncbi_submission_Oropouche_virus/summary.txt"),
"annotation": [Path("ncbi_submission_Oropouche_virus/annotation.tbl")],
}
},
]
The "annotation" key is only present for custom viruses that have TBL files.
For viruses organized into subdirectories (Influenza segments, Norovirus genogroups, custom viruses with split_by_segments=True), each subdirectory produces its own entry. The label uses a "Type/Subgroup" format, e.g. "InfluenzaA/HA" or "Norovirus/GII".
Examples
Process all viruses found in the results file
from pathlib import Path
from viralqc.core import PrepareSubmission
ps = PrepareSubmission(
viralqc_results=Path("results.tsv"),
viralqc_target_seq=Path("sequences_target_regions.fasta"),
viralqc_input_seq=Path("sequences.fasta"),
samples_metadata=[
{
"sample_id": "S001",
"country": "Brazil",
"host": "Homo sapiens",
"isolate": "isolate/S001/2024",
"collection-date": "2024-01-01",
"isolation-source": "Serum",
},
{
"sample_id": "S002",
"country": "Colombia",
"host": "Homo sapiens",
"isolate": "isolate/S002/2024",
"collection-date": "2024-02-15",
"isolation-source": "Nasopharyngeal swab",
},
],
)
results = ps.run_virus() # process all virus groups
for entry in results:
for virus_label, files in entry.items():
print(f"{virus_label}:")
for seq_path in files["sequences"]:
print(f" sequences → {seq_path}")
for meta_path in files["metadata"]:
print(f" metadata → {meta_path}")
if "annotation" in files:
for ann_path in files["annotation"]:
print(f" annotation → {ann_path}")
Process only specific sample IDs
results = ps.run_sample(samples=["S001"])
Process a custom virus with segment splitting
ps = PrepareSubmission(
viralqc_results=Path("results.tsv"),
viralqc_target_seq=Path("sequences_target_regions.fasta"),
viralqc_input_seq=Path("sequences.fasta"),
samples_metadata=[...],
split_by_segments=True,
)
results = ps.run_virus(virus="custom", virus_name="Oropouche virus")
# Produces ncbi_submission_Oropouche_virus/S/, /M/, /L/ subdirectories