# Preparing NCBI Submissions

`viralQC` includes a `prepare-ncbi-submission` command to organize sequences and generate metadata CSVs formatted for NCBI submission.

The command contains two sub-commands:
1. `virus`: Groups and organizes sequences by virus (or types/segments).
2. `sample`: Organizes sequences by viruses but only for individual sample IDs.

Both commands require the output files generated by a `viralQC run`:
* `--results`: The main ViralQC results file (e.g. `results.tsv`).
* `--sequences-vqc`: The output target FASTA file (e.g. `sequences_target_regions.fasta`) generated by `viralQC run`. This file contains target sequences processed by ViralQC and is prioritized.
* `--sequences-input`: The original input FASTA file you passed to `viralQC run`.
  *(Note: The command prioritizes sequences found in `--sequences-vqc`. If a sequence was filtered/dropped by VQC (by lack of quality information) but you still want to submit it, it will be pulled from `--sequences-input`.)*
* `--output-prefix` (optional, default `ncbi_submission`): Prefix used for the generated output directories.

## Grouping by Virus (`virus`)

When using the `virus` sub-command, the sequences are bundled into directories specific to the virus or segment.

```bash
vqc prepare-ncbi-submission virus [SUBCOMMAND] \
    --results results.tsv \
    --sequences sequences_target_regions.fasta
```

### Supported Viruses

For Dengue, Influenza, Norovirus, and SARS-CoV-2, NCBI has specific submission requirements. Each requires specific columns in the `metadata.csv` file, and these subcommands format the metadata CSV accordingly.

#### SARS-CoV-2
```bash
vqc prepare-ncbi-submission virus sars-cov-2 \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv
```

Organizes all SARS-CoV-2 sequences into `ncbi_submission_SARS-CoV-2/`.

#### Dengue
```bash
vqc prepare-ncbi-submission virus dengue \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv
```

Organizes sequences by type, creating directories such as `ncbi_submission_Dengue1/`, `ncbi_submission_Dengue2/`, etc., depending on the serotypes identified in the ViralQC analysis.

#### Influenza
```bash
vqc prepare-ncbi-submission virus influenza \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv
```

Organizes sequences by type with subdirectories for each segment, creating directories like `ncbi_submission_InfluenzaA/HA/`, `ncbi_submission_InfluenzaA/NA/`, `ncbi_submission_InfluenzaB/HA/`, etc.

#### Norovirus
```bash
vqc prepare-ncbi-submission virus norovirus \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv
```

Organizes sequences by genogroup, creating subdirectories such as `ncbi_submission_Norovirus/GI/`, `ncbi_submission_Norovirus/GII/`, etc., depending on the genogroups identified in the ViralQC analysis. Supported genogroups are GI through GVI.

#### Custom Viruses
```bash
vqc prepare-ncbi-submission virus custom \
    --virus-name "Respiratory syncytial virus A" \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv
```

Organizes all sequences matching the given `--virus-name` into a single directory. Non-standard viruses automatically get the `[Organism=...]` qualifier added to their FASTA headers based on the `virus_species` identified. The name provided should be the same present into the `virus` field of the results file.

If the virus has annotated segments (e.g., S, M, L), you can pass the `--split-by-segments` flag to organize sequences into per-segment subdirectories (e.g., `ncbi_submission_Oropouche_virus/S/`, `ncbi_submission_Oropouche_virus/M/`, `ncbi_submission_Oropouche_virus/L/`). Each subdirectory will contain its own `sequences.fasta`, `metadata.tsv`, `annotation.tbl`, and log files. Without this flag, all sequences are placed in a single flat directory.

Additionally, for custom viruses, you can pass `--tbl-dir` pointing to a folder with per-sample `.tbl` annotation files. These will be concatenated into a single `annotation.tbl` file alongside the FASTA sequences. If you don't provide this option, the command will try to find the `.tbl` based on `tbl_path` field in the results file.

## Grouping by Sample (`sample`)

If you prefer to organize submissions just for specific samples, or for all samples independently of the virus, use the `sample` sub-command.

```bash
vqc prepare-ncbi-submission sample \
    --sample <sample_id_1> \
    --sample <sample_id_2> \
    --results results.tsv \
    --sequences-vqc sequences_target_regions.fasta \
    --sequences-input original_sequences.fasta \
    --metadata input_metadata.csv
```
You can pass multiple `--sample` options, or provide a text file with one ID per line via `--sample-ids samples.txt`. 

To process **all** samples present in the results file, simply use:
```bash
vqc prepare-ncbi-submission sample --sample all ...
```

To process samples based into a list of sample IDs, use the `--sample-ids` option:
```bash
vqc prepare-ncbi-submission sample --sample-ids samples.txt ...
```

This creates one directory per virus (e.g., `ncbi_submission_Dengue1/`). If a sequence was skipped or lacked data (commonly tbl files), it is informed in a `[prefix]_skipped.tsv` file.

## Metadata CSV

For all submission commands, you can (or must, for predefined viruses) provide an input metadata CSV file via `--metadata`.

### Input Columns

The input CSV can contain the following columns. Their necessity depends on the virus type being processed:

#### Required Columns

| Column | Description | Dengue, Influenza & Norovirus | SARS-CoV-2 | Custom Viruses |
|--------|-------------|-------------------------------|------------|----------------|
| `Sequence_ID` | Must match the `seqName` exactly as it appears in the results file and FASTA headers. **Must be less than 25 characters long.** | **Required** | **Required** | **Required** |
| `geo_loc_name` | The geographical location of the sample (e.g., Country). | **Required** | **Required** | Optional |
| `host` | The natural host of the virus (e.g., `Homo sapiens`). Do not use special characters. | **Required** | **Required** | Optional |
| `isolate` | The isolate name or identifier string. | **Required** | **Required** | Optional |
| `collection-date` | The date the sample was collected, typically using the `YYYY-MM-DD` format. | **Required** | **Required** | Optional |
| `isolation-source` | The source material of the sample (e.g., `Serum`, `Swab`). | **Required** | *Ignored* | Optional |

*Note: For Custom viruses, creating the `--metadata` file itself is completely optional. If provided, only `Sequence_ID` must be present, and the other data columns will be included if you add them.*

#### Optional Columns — Standard Viruses (Dengue, Influenza, Norovirus, SARS-CoV-2)

Any of the following INSDC source modifiers can be added to the metadata CSV. If present, they will be automatically included in the output `metadata.csv`:

| Column (CSV header) | Description |
|---------------------|-------------|
| `altitude` | Altitude in metres above or below sea level where the sample was collected. |
| `collected_by` | Name of the person who collected the sample. |
| `culture_collection` | Institution code and culture ID (format: `inst:coll:id`). |
| `haplotype` | Haplotype of the organism. |
| `lab_host` | Laboratory host used to propagate the organism. |
| `lat_lon` | Latitude and longitude in decimal degrees (e.g., `15.77 S 47.93 W`). |
| `note` | Any additional free-text information about the sequence. |
| `segment` | Name of the viral or phage segment sequenced. |
| `sex` | Sex of the organism from which the sequence was obtained. |
| `specimen_voucher` | Institutional identifier for the source specimen. |
| `strain` | Strain of the organism. |
| `tissue_type` | Type of tissue from which the sequence was obtained. |

#### Optional Columns — Custom Viruses

For custom viruses the metadata is a tab-delimited file and the column names follow the BankIt Title_Case convention. The following columns can be added to the input CSV using the standard lower-case names — they will be automatically renamed in the output:

| Input column | Output column | Description |
|---|---|---|
| `altitude` | `Altitude` | Altitude in metres. |
| `bio_material` | `Bio_material` | Biological material identifier. |
| `breed` | `Breed` | Named breed (usually for domesticated mammals). |
| `cell_line` | `Cell_line` | Cell line from which the sequence was obtained. |
| `cell_type` | `Cell_type` | Type of cell. |
| `clone` | `Clone` | Clone from which the sequence was obtained. |
| `collected_by` | `Collected_by` | Person who collected the sample. |
| `culture_collection` | `Culture_collection` | Culture collection identifier. |
| `dev_stage` | `Dev_stage` | Developmental stage of the organism. |
| `ecotype` | `Ecotype` | Named ecotype. |
| `fwd_primer_name` | `Fwd_primer_name` | Name of forward PCR primer. |
| `fwd_primer_seq` | `Fwd_primer_seq` | Sequence of forward PCR primer. |
| `genotype` | `Genotype` | Genotype of the organism. |
| `haplogroup` | `Haplogroup` | Haplogroup of the organism. |
| `haplotype` | `Haplotype` | Haplotype of the organism. |
| `lab_host` | `Lab_host` | Laboratory host used to propagate the organism. |
| `lat_lon` | `Lat_Lon` | Latitude and longitude in decimal degrees. |
| `note` | `Note` | Free-text additional information. |
| `rev_primer_name` | `Rev_primer_name` | Name of reverse PCR primer. |
| `rev_primer_seq` | `Rev_primer_seq` | Sequence of reverse PCR primer. |
| `segment` | `Segment` | Viral or phage segment sequenced. |
| `serotype` | `Serotype` | Serological variety. |
| `serovar` | `Serovar` | Serological variety (prokaryote). |
| `sex` | `Sex` | Sex of the organism. |
| `specimen_voucher` | `Specimen_voucher` | Specimen voucher identifier. |
| `strain` | `Strain` | Strain of the organism. |
| `sub_species` | `Sub_species` | Subspecies. |
| `tissue_lib` | `Tissue_lib` | Tissue library. |
| `tissue_type` | `Tissue_type` | Type of tissue. |
| `variety` | `Variety` | Variety of the organism. |

> **Note:** Columns not listed above are silently ignored and will not appear in the output metadata file.

### Output Format

The `prepare-ncbi-submission` command will take your input CSV and generate a final metadata file inside each submission directory. For predefined viruses (SARS-CoV-2, Dengue, Influenza, Norovirus), a comma-delimited `metadata.csv` is generated. For custom viruses, a tab-delimited `metadata.tsv` is generated instead. The tool automatically enriches this file with taxonomic and typing data derived from the `viralQC` results:

* **Influenza**: Adds `serotype` (e.g., `H1N1` or `H3N2`) extracted directly from the classification.
* **Dengue**: Adds `genotype` (e.g., `1`, `2`, `3`, or `4`) and `serotype` (the detailed clade assignment).
* **Norovirus**: Adds `genotype` (e.g., `GII`, `GII.17`) extracted from the virus classification. No `serotype` column is included.
* **SARS-CoV-2**: Removes `isolation-source` and `serotype` as they are not typically included in SC2 NCBI submissions.
* **Custom Viruses**: Renames columns to NCBI-compatible names: `geo_loc_name` → `Country (geo_loc_name)`, `host` → `Host`, `isolate` → `Isolate`, `collection-date` → `Collection_date`, `isolation-source` → `Isolation_source`. No `Organism` column is added (the organism information is included in the FASTA headers as `[Organism=...]`).

## FASTA Headers and Annotations

FASTA headers are carefully managed during organization:
* Sequences failing quality thresholds (e.g. length < 150nt, or N content ≥ 50%) are excluded from the FASTA. The reason for each exclusion is recorded in the `summary.txt` file inside the relevant output directory.
* **Case-insensitive duplicate sequence IDs are automatically detected and removed.** NCBI treats sequence IDs as case-insensitive — for example, `SEQ001` and `seq001` are considered identical by NCBI and would cause an upload error. When a clash is detected, the **second** occurrence is dropped and the reason is logged in `summary.txt`:
  ```
  dropped: 1 sequence(s)
    - seq001: Case-insensitive duplicate of 'SEQ001' (NCBI treats IDs as case-insensitive)
  ```
* Unsafe characters in sequence names (non-ASCII or pipes) are sanitized to underscores for NCBI compatibility, with translations logged in `renamed_headers.tsv`.
* Spaces and brackets are preserved correctly, allowing standard NCBI feature qualifiers like `[Organism=...]` to work as intended for non-standard viruses.

## Batch Splitting

NCBI limits submissions to 3,000 sequences per file. When a virus group exceeds this limit, the `sequences.fasta` and metadata files are automatically split into numbered batches:

* `sequences.1.fasta`, `metadata.1.csv` (or `metadata.1.tsv` for custom viruses) — first 2,999 sequences
* `sequences.2.fasta`, `metadata.2.csv` (or `metadata.2.tsv`) — next 2,999 sequences
* …and so on.

If a group has 2,999 or fewer sequences, the files are written normally without any suffix.

## Python API

The preparation logic is also available as an importable Python class, `PrepareSubmission`, for use in third-party scripts or pipelines. Instead of reading metadata from a CSV file, the class accepts a list of Python dicts.

### Installation

The class is available after installing `viralqc` as a package. No additional dependencies are required.

```python
from viralqc.core import PrepareSubmission
```

### Constructor

```python
PrepareSubmission(
    viralqc_results,      # Path – ViralQC results file (.tsv, .csv or .json)
    viralqc_target_seq,   # Path – sequences_target_regions.fasta produced by vqc run
    viralqc_input_seq,    # Path – original input FASTA passed to vqc run
    samples_metadata,     # list[dict] – sample metadata (see below)
    output_prefix="ncbi_submission",  # str  – prefix for output directories
    split_by_segments=False,          # bool – split custom viruses by segment
    tbl_dir=None,                     # Path|None – folder with per-sample .tbl files
)
```

Each dict in `samples_metadata` uses the following keys:

| Key | Description | Standard viruses | Custom viruses |
|-----|-------------|-----------------|----------------|
| `sample_id` | Must match `seqName` exactly. **Max 24 characters.** | **Required** | **Required** |
| `country` | Geographic location (maps to `geo_loc_name`). | **Required** | Optional |
| `host` | Natural host (e.g. `Homo sapiens`). | **Required** | Optional |
| `isolate` | Isolate name or identifier. | **Required** | Optional |
| `collection-date` | Collection date (`YYYY-MM-DD`). | **Required** | Optional |
| `isolation-source` | Source material (e.g. `Serum`). | **Required** | Optional |

### Methods

#### `run_virus(virus="all", virus_name=None)`

Prepares submission packages grouped by virus type. Equivalent to `vqc prepare-ncbi-submission virus <subcommand>`.

- `virus`: `"all"` (default), `"sars-cov-2"`, `"dengue"`, `"influenza"`, `"norovirus"`, or `"custom"`.
- `virus_name`: required when `virus="custom"`.

#### `run_sample(samples=["all"])`

Prepares packages for specific samples or all samples. Equivalent to `vqc prepare-ncbi-submission sample`.

- `samples`: a list of sample IDs, or `["all"]` to process every sample.

### Return Value

Both methods return a list of dicts, one entry per generated output directory:

```python
[
    {
        "SARS-CoV-2": {
            "sequences":  [Path("ncbi_submission_SARS-CoV-2/sequences.fasta")],
            "metadata":   [Path("ncbi_submission_SARS-CoV-2/metadata.csv")],
            "log":         Path("ncbi_submission_SARS-CoV-2/summary.txt"),
        }
    },
    {
        "Oropouche virus": {
            "sequences":  [Path("ncbi_submission_Oropouche_virus/sequences.fasta")],
            "metadata":   [Path("ncbi_submission_Oropouche_virus/metadata.tsv")],
            "log":         Path("ncbi_submission_Oropouche_virus/summary.txt"),
            "annotation": [Path("ncbi_submission_Oropouche_virus/annotation.tbl")],
        }
    },
]
```

The `"annotation"` key is only present for custom viruses that have TBL files.

For viruses organized into subdirectories (Influenza segments, Norovirus genogroups, custom viruses with `split_by_segments=True`), each subdirectory produces its own entry. The label uses a `"Type/Subgroup"` format, e.g. `"InfluenzaA/HA"` or `"Norovirus/GII"`.

### Examples

#### Process all viruses found in the results file

```python
from pathlib import Path
from viralqc.core import PrepareSubmission

ps = PrepareSubmission(
    viralqc_results=Path("results.tsv"),
    viralqc_target_seq=Path("sequences_target_regions.fasta"),
    viralqc_input_seq=Path("sequences.fasta"),
    samples_metadata=[
        {
            "sample_id": "S001",
            "country": "Brazil",
            "host": "Homo sapiens",
            "isolate": "isolate/S001/2024",
            "collection-date": "2024-01-01",
            "isolation-source": "Serum",
        },
        {
            "sample_id": "S002",
            "country": "Colombia",
            "host": "Homo sapiens",
            "isolate": "isolate/S002/2024",
            "collection-date": "2024-02-15",
            "isolation-source": "Nasopharyngeal swab",
        },
    ],
)

results = ps.run_virus()  # process all virus groups
for entry in results:
    for virus_label, files in entry.items():
        print(f"{virus_label}:")
        for seq_path in files["sequences"]:
            print(f"  sequences → {seq_path}")
        for meta_path in files["metadata"]:
            print(f"  metadata  → {meta_path}")
        if "annotation" in files:
            for ann_path in files["annotation"]:
                print(f"  annotation → {ann_path}")
```

#### Process only specific sample IDs

```python
results = ps.run_sample(samples=["S001"])
```

#### Process a custom virus with segment splitting

```python
ps = PrepareSubmission(
    viralqc_results=Path("results.tsv"),
    viralqc_target_seq=Path("sequences_target_regions.fasta"),
    viralqc_input_seq=Path("sequences.fasta"),
    samples_metadata=[...],
    split_by_segments=True,
)
results = ps.run_virus(virus="custom", virus_name="Oropouche virus")
# Produces ncbi_submission_Oropouche_virus/S/, /M/, /L/ subdirectories
```