# Preparing NCBI Submissions `viralQC` includes a `prepare-ncbi-submission` command to organize sequences and generate metadata CSVs formatted for NCBI submission. The command contains two sub-commands: 1. `virus`: Groups and organizes sequences by virus (or types/segments). 2. `sample`: Organizes sequences by viruses but only for individual sample IDs. Both commands require the output files generated by a `viralQC run`: * `--results`: The main ViralQC results file (e.g. `results.tsv`). * `--sequences-vqc`: The output target FASTA file (e.g. `sequences_target_regions.fasta`) generated by `viralQC run`. This file contains target sequences processed by ViralQC and is prioritized. * `--sequences-input`: The original input FASTA file you passed to `viralQC run`. *(Note: The command prioritizes sequences found in `--sequences-vqc`. If a sequence was filtered/dropped by VQC (by lack of quality information) but you still want to submit it, it will be pulled from `--sequences-input`.)* * `--output-prefix` (optional, default `ncbi_submission`): Prefix used for the generated output directories. ## Grouping by Virus (`virus`) When using the `virus` sub-command, the sequences are bundled into directories specific to the virus or segment. ```bash vqc prepare-ncbi-submission virus [SUBCOMMAND] \ --results results.tsv \ --sequences sequences_target_regions.fasta ``` ### Supported Viruses For Dengue, Influenza, Norovirus, and SARS-CoV-2, NCBI has specific submission requirements. Each requires specific columns in the `metadata.csv` file, and these subcommands format the metadata CSV accordingly. #### SARS-CoV-2 ```bash vqc prepare-ncbi-submission virus sars-cov-2 \ --results results.tsv \ --sequences-vqc sequences_target_regions.fasta \ --sequences-input original_sequences.fasta \ --metadata input_metadata.csv ``` Organizes all SARS-CoV-2 sequences into `ncbi_submission_SARS-CoV-2/`. #### Dengue ```bash vqc prepare-ncbi-submission virus dengue \ --results results.tsv \ --sequences-vqc sequences_target_regions.fasta \ --sequences-input original_sequences.fasta \ --metadata input_metadata.csv ``` Organizes sequences by type, creating directories such as `ncbi_submission_Dengue1/`, `ncbi_submission_Dengue2/`, etc., depending on the serotypes identified in the ViralQC analysis. #### Influenza ```bash vqc prepare-ncbi-submission virus influenza \ --results results.tsv \ --sequences-vqc sequences_target_regions.fasta \ --sequences-input original_sequences.fasta \ --metadata input_metadata.csv ``` Organizes sequences by type with subdirectories for each segment, creating directories like `ncbi_submission_InfluenzaA/HA/`, `ncbi_submission_InfluenzaA/NA/`, `ncbi_submission_InfluenzaB/HA/`, etc. #### Norovirus ```bash vqc prepare-ncbi-submission virus norovirus \ --results results.tsv \ --sequences-vqc sequences_target_regions.fasta \ --sequences-input original_sequences.fasta \ --metadata input_metadata.csv ``` Organizes sequences by genogroup, creating subdirectories such as `ncbi_submission_Norovirus/GI/`, `ncbi_submission_Norovirus/GII/`, etc., depending on the genogroups identified in the ViralQC analysis. Supported genogroups are GI through GVI. #### Custom Viruses ```bash vqc prepare-ncbi-submission virus custom \ --virus-name "Respiratory syncytial virus A" \ --results results.tsv \ --sequences-vqc sequences_target_regions.fasta \ --sequences-input original_sequences.fasta \ --metadata input_metadata.csv ``` Organizes all sequences matching the given `--virus-name` into a single directory. Non-standard viruses automatically get the `[Organism=...]` qualifier added to their FASTA headers based on the `virus_species` identified. The name provided should be the same present into the `virus` field of the results file. If the virus has annotated segments (e.g., S, M, L), you can pass the `--split-by-segments` flag to organize sequences into per-segment subdirectories (e.g., `ncbi_submission_Oropouche_virus/S/`, `ncbi_submission_Oropouche_virus/M/`, `ncbi_submission_Oropouche_virus/L/`). Each subdirectory will contain its own `sequences.fasta`, `metadata.tsv`, `annotation.tbl`, and log files. Without this flag, all sequences are placed in a single flat directory. Additionally, for custom viruses, you can pass `--tbl-dir` pointing to a folder with per-sample `.tbl` annotation files. These will be concatenated into a single `annotation.tbl` file alongside the FASTA sequences. If you don't provide this option, the command will try to find the `.tbl` based on `tbl_path` field in the results file. ## Grouping by Sample (`sample`) If you prefer to organize submissions just for specific samples, or for all samples independently of the virus, use the `sample` sub-command. ```bash vqc prepare-ncbi-submission sample \ --sample \ --sample \ --results results.tsv \ --sequences-vqc sequences_target_regions.fasta \ --sequences-input original_sequences.fasta \ --metadata input_metadata.csv ``` You can pass multiple `--sample` options, or provide a text file with one ID per line via `--sample-ids samples.txt`. To process **all** samples present in the results file, simply use: ```bash vqc prepare-ncbi-submission sample --sample all ... ``` To process samples based into a list of sample IDs, use the `--sample-ids` option: ```bash vqc prepare-ncbi-submission sample --sample-ids samples.txt ... ``` This creates one directory per virus (e.g., `ncbi_submission_Dengue1/`). If a sequence was skipped or lacked data (commonly tbl files), it is informed in a `[prefix]_skipped.tsv` file. ## Metadata CSV For all submission commands, you can (or must, for predefined viruses) provide an input metadata CSV file via `--metadata`. ### Input Columns The input CSV can contain the following columns. Their necessity depends on the virus type being processed: #### Required Columns | Column | Description | Dengue, Influenza & Norovirus | SARS-CoV-2 | Custom Viruses | |--------|-------------|-------------------------------|------------|----------------| | `Sequence_ID` | Must match the `seqName` exactly as it appears in the results file and FASTA headers. **Must be less than 25 characters long.** | **Required** | **Required** | **Required** | | `geo_loc_name` | The geographical location of the sample (e.g., Country). | **Required** | **Required** | Optional | | `host` | The natural host of the virus (e.g., `Homo sapiens`). Do not use special characters. | **Required** | **Required** | Optional | | `isolate` | The isolate name or identifier string. | **Required** | **Required** | Optional | | `collection-date` | The date the sample was collected, typically using the `YYYY-MM-DD` format. | **Required** | **Required** | Optional | | `isolation-source` | The source material of the sample (e.g., `Serum`, `Swab`). | **Required** | *Ignored* | Optional | *Note: For Custom viruses, creating the `--metadata` file itself is completely optional. If provided, only `Sequence_ID` must be present, and the other data columns will be included if you add them.* #### Optional Columns — Standard Viruses (Dengue, Influenza, Norovirus, SARS-CoV-2) Any of the following INSDC source modifiers can be added to the metadata CSV. If present, they will be automatically included in the output `metadata.csv`: | Column (CSV header) | Description | |---------------------|-------------| | `altitude` | Altitude in metres above or below sea level where the sample was collected. | | `collected_by` | Name of the person who collected the sample. | | `culture_collection` | Institution code and culture ID (format: `inst:coll:id`). | | `haplotype` | Haplotype of the organism. | | `lab_host` | Laboratory host used to propagate the organism. | | `lat_lon` | Latitude and longitude in decimal degrees (e.g., `15.77 S 47.93 W`). | | `note` | Any additional free-text information about the sequence. | | `segment` | Name of the viral or phage segment sequenced. | | `sex` | Sex of the organism from which the sequence was obtained. | | `specimen_voucher` | Institutional identifier for the source specimen. | | `strain` | Strain of the organism. | | `tissue_type` | Type of tissue from which the sequence was obtained. | #### Optional Columns — Custom Viruses For custom viruses the metadata is a tab-delimited file and the column names follow the BankIt Title_Case convention. The following columns can be added to the input CSV using the standard lower-case names — they will be automatically renamed in the output: | Input column | Output column | Description | |---|---|---| | `altitude` | `Altitude` | Altitude in metres. | | `bio_material` | `Bio_material` | Biological material identifier. | | `breed` | `Breed` | Named breed (usually for domesticated mammals). | | `cell_line` | `Cell_line` | Cell line from which the sequence was obtained. | | `cell_type` | `Cell_type` | Type of cell. | | `clone` | `Clone` | Clone from which the sequence was obtained. | | `collected_by` | `Collected_by` | Person who collected the sample. | | `culture_collection` | `Culture_collection` | Culture collection identifier. | | `dev_stage` | `Dev_stage` | Developmental stage of the organism. | | `ecotype` | `Ecotype` | Named ecotype. | | `fwd_primer_name` | `Fwd_primer_name` | Name of forward PCR primer. | | `fwd_primer_seq` | `Fwd_primer_seq` | Sequence of forward PCR primer. | | `genotype` | `Genotype` | Genotype of the organism. | | `haplogroup` | `Haplogroup` | Haplogroup of the organism. | | `haplotype` | `Haplotype` | Haplotype of the organism. | | `lab_host` | `Lab_host` | Laboratory host used to propagate the organism. | | `lat_lon` | `Lat_Lon` | Latitude and longitude in decimal degrees. | | `note` | `Note` | Free-text additional information. | | `rev_primer_name` | `Rev_primer_name` | Name of reverse PCR primer. | | `rev_primer_seq` | `Rev_primer_seq` | Sequence of reverse PCR primer. | | `segment` | `Segment` | Viral or phage segment sequenced. | | `serotype` | `Serotype` | Serological variety. | | `serovar` | `Serovar` | Serological variety (prokaryote). | | `sex` | `Sex` | Sex of the organism. | | `specimen_voucher` | `Specimen_voucher` | Specimen voucher identifier. | | `strain` | `Strain` | Strain of the organism. | | `sub_species` | `Sub_species` | Subspecies. | | `tissue_lib` | `Tissue_lib` | Tissue library. | | `tissue_type` | `Tissue_type` | Type of tissue. | | `variety` | `Variety` | Variety of the organism. | > **Note:** Columns not listed above are silently ignored and will not appear in the output metadata file. ### Output Format The `prepare-ncbi-submission` command will take your input CSV and generate a final metadata file inside each submission directory. For predefined viruses (SARS-CoV-2, Dengue, Influenza, Norovirus), a comma-delimited `metadata.csv` is generated. For custom viruses, a tab-delimited `metadata.tsv` is generated instead. The tool automatically enriches this file with taxonomic and typing data derived from the `viralQC` results: * **Influenza**: Adds `serotype` (e.g., `H1N1` or `H3N2`) extracted directly from the classification. * **Dengue**: Adds `genotype` (e.g., `1`, `2`, `3`, or `4`) and `serotype` (the detailed clade assignment). * **Norovirus**: Adds `genotype` (e.g., `GII`, `GII.17`) extracted from the virus classification. No `serotype` column is included. * **SARS-CoV-2**: Removes `isolation-source` and `serotype` as they are not typically included in SC2 NCBI submissions. * **Custom Viruses**: Renames columns to NCBI-compatible names: `geo_loc_name` → `Country (geo_loc_name)`, `host` → `Host`, `isolate` → `Isolate`, `collection-date` → `Collection_date`, `isolation-source` → `Isolation_source`. No `Organism` column is added (the organism information is included in the FASTA headers as `[Organism=...]`). ## FASTA Headers and Annotations FASTA headers are carefully managed during organization: * Sequences failing quality thresholds (e.g. length < 150nt, or N content ≥ 50%) are excluded from the FASTA. The reason for each exclusion is recorded in the `summary.txt` file inside the relevant output directory. * **Case-insensitive duplicate sequence IDs are automatically detected and removed.** NCBI treats sequence IDs as case-insensitive — for example, `SEQ001` and `seq001` are considered identical by NCBI and would cause an upload error. When a clash is detected, the **second** occurrence is dropped and the reason is logged in `summary.txt`: ``` dropped: 1 sequence(s) - seq001: Case-insensitive duplicate of 'SEQ001' (NCBI treats IDs as case-insensitive) ``` * Unsafe characters in sequence names (non-ASCII or pipes) are sanitized to underscores for NCBI compatibility, with translations logged in `renamed_headers.tsv`. * Spaces and brackets are preserved correctly, allowing standard NCBI feature qualifiers like `[Organism=...]` to work as intended for non-standard viruses. ## Batch Splitting NCBI limits submissions to 3,000 sequences per file. When a virus group exceeds this limit, the `sequences.fasta` and metadata files are automatically split into numbered batches: * `sequences.1.fasta`, `metadata.1.csv` (or `metadata.1.tsv` for custom viruses) — first 2,999 sequences * `sequences.2.fasta`, `metadata.2.csv` (or `metadata.2.tsv`) — next 2,999 sequences * …and so on. If a group has 2,999 or fewer sequences, the files are written normally without any suffix. ## Python API The preparation logic is also available as an importable Python class, `PrepareSubmission`, for use in third-party scripts or pipelines. Instead of reading metadata from a CSV file, the class accepts a list of Python dicts. ### Installation The class is available after installing `viralqc` as a package. No additional dependencies are required. ```python from viralqc.core import PrepareSubmission ``` ### Constructor ```python PrepareSubmission( viralqc_results, # Path – ViralQC results file (.tsv, .csv or .json) viralqc_target_seq, # Path – sequences_target_regions.fasta produced by vqc run viralqc_input_seq, # Path – original input FASTA passed to vqc run samples_metadata, # list[dict] – sample metadata (see below) output_prefix="ncbi_submission", # str – prefix for output directories split_by_segments=False, # bool – split custom viruses by segment tbl_dir=None, # Path|None – folder with per-sample .tbl files ) ``` Each dict in `samples_metadata` uses the following keys: | Key | Description | Standard viruses | Custom viruses | |-----|-------------|-----------------|----------------| | `sample_id` | Must match `seqName` exactly. **Max 24 characters.** | **Required** | **Required** | | `country` | Geographic location (maps to `geo_loc_name`). | **Required** | Optional | | `host` | Natural host (e.g. `Homo sapiens`). | **Required** | Optional | | `isolate` | Isolate name or identifier. | **Required** | Optional | | `collection-date` | Collection date (`YYYY-MM-DD`). | **Required** | Optional | | `isolation-source` | Source material (e.g. `Serum`). | **Required** | Optional | ### Methods #### `run_virus(virus="all", virus_name=None)` Prepares submission packages grouped by virus type. Equivalent to `vqc prepare-ncbi-submission virus `. - `virus`: `"all"` (default), `"sars-cov-2"`, `"dengue"`, `"influenza"`, `"norovirus"`, or `"custom"`. - `virus_name`: required when `virus="custom"`. #### `run_sample(samples=["all"])` Prepares packages for specific samples or all samples. Equivalent to `vqc prepare-ncbi-submission sample`. - `samples`: a list of sample IDs, or `["all"]` to process every sample. ### Return Value Both methods return a list of dicts, one entry per generated output directory: ```python [ { "SARS-CoV-2": { "sequences": [Path("ncbi_submission_SARS-CoV-2/sequences.fasta")], "metadata": [Path("ncbi_submission_SARS-CoV-2/metadata.csv")], "log": Path("ncbi_submission_SARS-CoV-2/summary.txt"), } }, { "Oropouche virus": { "sequences": [Path("ncbi_submission_Oropouche_virus/sequences.fasta")], "metadata": [Path("ncbi_submission_Oropouche_virus/metadata.tsv")], "log": Path("ncbi_submission_Oropouche_virus/summary.txt"), "annotation": [Path("ncbi_submission_Oropouche_virus/annotation.tbl")], } }, ] ``` The `"annotation"` key is only present for custom viruses that have TBL files. For viruses organized into subdirectories (Influenza segments, Norovirus genogroups, custom viruses with `split_by_segments=True`), each subdirectory produces its own entry. The label uses a `"Type/Subgroup"` format, e.g. `"InfluenzaA/HA"` or `"Norovirus/GII"`. ### Examples #### Process all viruses found in the results file ```python from pathlib import Path from viralqc.core import PrepareSubmission ps = PrepareSubmission( viralqc_results=Path("results.tsv"), viralqc_target_seq=Path("sequences_target_regions.fasta"), viralqc_input_seq=Path("sequences.fasta"), samples_metadata=[ { "sample_id": "S001", "country": "Brazil", "host": "Homo sapiens", "isolate": "isolate/S001/2024", "collection-date": "2024-01-01", "isolation-source": "Serum", }, { "sample_id": "S002", "country": "Colombia", "host": "Homo sapiens", "isolate": "isolate/S002/2024", "collection-date": "2024-02-15", "isolation-source": "Nasopharyngeal swab", }, ], ) results = ps.run_virus() # process all virus groups for entry in results: for virus_label, files in entry.items(): print(f"{virus_label}:") for seq_path in files["sequences"]: print(f" sequences → {seq_path}") for meta_path in files["metadata"]: print(f" metadata → {meta_path}") if "annotation" in files: for ann_path in files["annotation"]: print(f" annotation → {ann_path}") ``` #### Process only specific sample IDs ```python results = ps.run_sample(samples=["S001"]) ``` #### Process a custom virus with segment splitting ```python ps = PrepareSubmission( viralqc_results=Path("results.tsv"), viralqc_target_seq=Path("sequences_target_regions.fasta"), viralqc_input_seq=Path("sequences.fasta"), samples_metadata=[...], split_by_segments=True, ) results = ps.run_virus(virus="custom", virus_name="Oropouche virus") # Produces ncbi_submission_Oropouche_virus/S/, /M/, /L/ subdirectories ```