# Developer Guide This guide provides detailed documentation for the Python scripts used within the ViralQC pipeline. It is intended for developers who want to understand the internal logic, implementation choices, and execution flow of the tool. The scripts are located in `viralqc/scripts/python/` and are orchestrated by Snakemake workflows. ## Dataset Management These scripts are used by the `get-nextclade-datasets` and `get-blast-database` commands to download and prepare reference data. ### get_github_dataset.py **Purpose**: Downloads specific dataset folders from a GitHub repository without using the GitHub API (to avoid rate limits). **Execution**: Called by `get_public_datasets.smk` when processing `github` configured viruses. **Implementation Details**: - **No API Usage**: Instead of using the GitHub API, it downloads the repository archive via `https://codeload.github.com/.../zip/refs/heads/main`. - **Selective Extraction**: It streams the zip file and only extracts files that match the requested `dataset-path`. - **Structure Flattening**: It handles the stripping of the root folder (e.g., `repo-main/`) to place files directly in the target directory. ### jsonl_to_gff.py **Purpose**: Converts NCBI Datasets JSONL annotation reports into GFF3 format, ensuring compatibility with Nextclade. **Execution**: Called by `get_blast_database.smk` after downloading NCBI RefSeq data. **Key Functions**: - `clean_cds_name(cds_name)`: Sanitizes CDS names by removing special characters, truncating to 20 characters, and standardizing formatting. This is crucial because Nextclade can fail with complex or overly long gene names. - `jsonl_to_gff(...)`: - **Validation**: Checks if CDS lengths are multiples of 3. If not, the accession is marked as invalid and excluded. - **Grouping**: Groups split CDS entries (e.g., joined genes) by name to create single gene features. - **Gene vs CDS**: If CDS data is missing, it attempts to create a gene feature using the full genome length (if divisible by 3). ### get_minimizer_index.py **Purpose**: Generates a minimizer index JSON file from reference FASTA files, allowing Nextclade to map sequences to external (github-hosted) datasets. **Execution**: Called by `get_public_datasets.smk` for GitHub-hosted datasets. **Implementation Details**: - **Origin**: This is a simplified adaptation of the `minimizer.py` script from the Nextclade project. - **Customization**: The `fasta_read` function was modified to inject the dataset name into the sequence record annotations. This ensures that the generated index correctly associates sequence minimizers with their corresponding local dataset paths. ## Analysis Pipeline These scripts are executed during the main `vqc run` workflow to process sequences, identify viruses, and assess quality. ### format_nextclade_sort.py **Purpose**: Processes the output of `nextclade sort` to link identified datasets with their local file paths and identify sequences that didn't match any dataset. **Execution**: Run immediately after `nextclade sort` in `run_analysis.smk`. **Key Functions**: - `map_datasets_to_local_paths(...)`: Reads the YAML configuration to build a mapping between remote dataset names (e.g., from Nextclade) and local storage paths. - `format_nextclade_output(...)`: Merges "local" and "external" sort results. It adds a `localDataset` column pointing to the directory of the identified virus. - `write_unmapped_sequences(...)`: Extracts sequences that have no assigned dataset (`localDataset` is NaN) and writes their names to `unmapped_sequences.txt` for subsequent BLAST analysis. ### blast_wrapper.py **Purpose**: A wrapper around the `blastn` command to safely handle FASTA headers containing spaces. **Execution**: Run by the `blast` rule in `run_analysis.smk` for sequences that were not identified by Nextclade. **Implementation Details**: - **Sanitization**: BLAST can truncate headers at the first space, leading to ID mismatches. This script checks for spaces in headers. - **Renaming Flow**: 1. If spaces are found, it generates a temporary FASTA where sequences are renamed to simple indices (1, 2, 3...). 2. It saves a mapping file (`mapping.tsv`) linking indices to original headers. 3. Runs BLAST with the renamed file. 4. Restores the original headers in the BLAST output TSV using the mapping file. ### reorder_cds.py **Purpose**: Reorders the `cdsCoverage` string in Nextclade's TSV output to match the gene order defined in the GFF file. **Execution**: Run after every `nextclade run` execution. **Logic**: - Nextclade output genes in an alphabetical order and omit genes with zero coverage. - This script reads the GFF to establish the canonical gene order (`start` position). - It parses the existing `cdsCoverage` (format `Gene:Cov,...`), reorders it, and inserts `Gene:0.0` for any missing genes. This ensures consistent column ordering for downstream processing. ### post_process_nextclade.py **Purpose**: The central aggregation script that combines results from Nextclade, BLAST, and generic analyses into a final report. It calculates categorical quality metrics (grades A-D) and produces the final output (TSV, CSV, or JSON). **Execution**: The final step of the `post_process_nextclade` rule. **Memory Management & Generators**: This script is engineered to process massive datasets (e.g., millions of sequences) with a constant memory footprint for CSV/TSV outputs. 1. **Lazy Loading with Generators**: The `format_dfs` function is implemented as a Python **Generator**. Instead of returning a list of all DataFrames (which would load all files into RAM), it `yields` one processed DataFrame at a time. * **Logic**: It iterates through the list of input files. For each file, it reads the data, enriches it with metadata, optimizes types, yields it to the consumer, and **immediately** deletes the reference and forces garbage collection. 2. **Streamed Writing**: The `write_combined_df` (and its helper `_write_csv_tsv_output`) consumes this generator. It iterates over the generator, writing each yielded chunk to disk immediately using `mode='a'` (append). * **Result**: At any given moment, only the data from a single input file exists in memory. * **JSON Limitation**: For JSON output (`_write_json_output`), the script *must* accumulate all data to form a valid JSON array. However, it still employs garbage collection to discard intermediate processing artifacts as soon as they are appended to the main list. 3. **Explicit Garbage Collection**: Python's automatic GC might not trigger fast enough when dealing with large, tight loops of data loading. Explicit `del` combined with `gc.collect()` calls are strategically placed to ensure memory is released back to the OS before allocating the next chunk. **Core Functions**: * `format_dfs(files, config_file, blast_metadata_df)`: The primary generator. It determines if a result file belongs to a known virus (with dataset configured in YAML) or is a generic run (Nextclade executed with references informed by BLAST analysis). It calls the appropriate processing logic (`_process_with_virus_info` or `_process_generic_run`) and yields the result. * `load_blast_metadata(metadata_path)`: Loads the BLAST database metadata and normalizes column names (e.g., `accession` -> `virus`) to ensure consistency with Nextclade outputs. * `optimize_dataframe_memory(df)`: Analyzes DataFrame columns. If a string column (like `virus`, `clade`, `dataset`) has a low cardinality (number of unique values < 50% of total rows), it converts the column to `category` dtype. This drastically reduces RAM usage. * `add_qualities(df, virus_info)`: Applies the quality scoring logic row-by-row. It invokes the helper `_compute_metrics_qualities` and then uses the `get_*_quality` functions to assign grades. * `add_coverages(df, virus_info)`: Parses and Formats the `cdsCoverage` string. It also calculates coverage for specific target regions and adds them as new columns (`targetRegionsCoverage`, `targetGeneCoverage`). * `format_sc2_clade(df, dataset_name)`: Contains specific logic for SARS-CoV-2. Since Nextclade outputs Pango lineages in a specific column (`Nextclade_pango`), this function maps it to the standard `clade` column for consistency. * `create_unmapped_df(unmapped_sequences, blast_results, blast_metadata_df)`: Handles sequences that failed both Nextclade identification and BLAST. It reads the raw `unmapped_sequences.txt` and creates a DataFrame labeled as "Unclassified". * `write_combined_df(df_iterator, output_file, output_format, ...)`: The main dispatcher. It takes the `format_dfs` generator (iterator) and directs it to the appropriate writing backend (CSV/TSV or JSON). **Quality Control Functions**: * `get_genome_quality(scores)`: Aggregates individual metric scores into a final Genome Quality. It sums the scores (A=4, B=3, C=2, D=1) and normalizes to a 24-point scale to assign the final grade. * `get_target_regions_quality(...)`: Determines the quality of specific target regions. Logic: If the whole genome is A/B, return empty (implied good). Otherwise, calculate mean coverage of the target regions to assign a grade. * `get_cds_cov_quality(...)`: Checks every CDS against coverage thresholds to assign A/B/C/D grades per gene. * `get_missing_data_quality(coverage)`: Scored based on thresholds (0.9, 0.75, 0.5). * `get_private_mutations_quality(total, threshold)`: Scored based on deviation from a defined threshold. * `get_qc_quality(total)`: General scoring for count-metrics (0=A, 1=B, 2=C, >2=D). ### extract_target_regions.py **Purpose**: Extracts the genomic coordinates of "good quality" regions for downstream usage (e.g., consensus generation or primer design). **Execution**: Run by `extract_target_regions` rule after post-processing. **Selection Logic**: The `check_target_regions` function determines the best region to extract based on quality: 1. **Genome**: If the overall `genomeQuality` is A or B, the entire genome is selected. 2. **Target Region**: Else, if `targetRegionsQuality` is A or B, the specific target regions are selected. 3. **Target Gene**: Else, if `targetGeneQuality` is A or B, the target gene is selected. **Coordinate Mapping**: - Uses `get_regions` to look up the start/end coordinates of the selected feature (gene or full genome) in the GFF file. - Outputs a BED file compatible with `seqtk subseq`.