Viral Sequence Feature Extraction Tool


We use paired-end sequencing as an example to explain in detail how to use ViSFE for feature calculation:

Download All Demo Data

 

1. Input File Preparation

Before running, you need to prepare 4 files: reference sequence file, codon file, sample file, and sequencing file. The first three files are stored in the ./public/reference/ path, and the sequencing file is stored in the ./data/fastqraw/ path.

The metadata.txt file containing the sample name

SRR8574911
SRR8574890
SRR8574897
...

The ref_seq.fasta file containing the reference sequence

>ref_seq
ATGGAACACGACCTTGAGAGGGGCCCACCGGGCCCGCGACGGCCCCCTCGAGGACCCCCC
...

The sample.fastq file containing the sequencing data

@DJB775P1:248:D0MDGACXX:7:1202:12362:49613
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA
+
JJJJJIIJJJJJJHIHHHGHFFFFFFCEEEEEDBD?DDDDDDBDDDABDDCA
...

2. Data quality control and preprocessing

For the raw sequencing data, FastQC software is used to perform quality assessment, analyze the base quality, sequencing depth and other information of each sample, and generate a pre-quality control report; Subsequently, Trim Galore is used to remove adapters and low-quality sequences to obtain clean sequences, and then FastQC is run to generate a post-quality control report.

Pre-quality control files:

./data/fastqc/

Clean files:

./data/clean/

Post-quality control files:

./data/fastqc_clean/

Users can check the quality control report and adjust the appropriate parameters for data preprocessing.

3. Sequence alignment and splicing

The Burrows-Wheeler Aligner (BWA) software is used to build an index file for the viral reference genome. The EBV NC.007605.1 sequence with a length of 171,823 bp is selected as the viral reference sequence. The Fast Length Adjustment of Short Reads (FLASH) software is used to merge the quality-controlled short sequences from Read1 and Read2, generating longer fragment sequences.

Reference sequence index files:

./public/reference/

Paired-end sequencing sequence splicing files:

./data/fastqbind/

Sequence alignment

Sequence alignment results of successful splicing: ./result/bwasam/
Sequence alignment results of failed splicing: ./result/unbindsam/

4. Processing comparison results and merging

For successfully merged reads, BWA is then used to align them with the EBV reference sequence to extract the aligned sequences. For those unmerged reads from Read1 and Read2, BWA is used to align Read1 and Read2 separately with the reference sequence, and SAMtools software is subsequently conducted to extract the sequences that aligned at both ends. If there were duplicate regions in the sequences aligned at both ends, the duplicates are removed and the sequences are merged into one. On the contrary, Read1 and Read2 sequences are kept. A custom Python script was performed to integrate the processed sequences from each sample into a sequence matrix.

Extraction and processing of the comparison results

Successful splicing sequence: ./result/cleansample
Failed splicing sequence: ./result/unbindcleansample

Result merging

Merged file: ./result/finaldata/
Read counts for each site: ./result/finaldata/count/

5. Feature calculation

The sequence feature calculation uses mutation and entropy features at the nucleotide level. The mutation feature represents the mutation rate compared with the reference sequence, and the calculation formula is as follows:
The entropy feature represents the Shannon entropy of each nucleotide position and is calculated as follows:
For the i-th sample, j enumerates its sequence depthi, and nijk represents the nucleotide at position k of the j-th sequence of the i-th sample

Results files

Feature results: ./feature/data/raw: mutation, count, entropy_nuc, entropy_slide, entropy_group
Number of reads contained in each site: ./feature/data/each

Limiting the sequencing depth

By limiting the sequencing depth, it is easier to build the optimal machine learning model later: ./feature/data/file

Merging all results

./feature/data/sampledata

The final merged feature matrix is shown below:

Users can directly use this data for subsequent machine learning model training and prediction, screening of differential sites, and calculation of correlation with clinical indicators.
Sample Group NC_site Mutation Shannon_entropy
8 NPC 14 9.92950054612253e-05 0.0014636622713465
3 NPC 9 8.858965272856131e-05 0.0013204405924585
11 non_NPC 253 1.0 0.0240075762881998
15 non_NPC 317 1.0 0.0073675609537136
2 non_NPC 38 1.0 0.0188142080168351
11 non_NPC 317 1.0 0.0
1 non_NPC 317 1.0 0.0033416393172456
2 non_NPC 317 1.0 0.0
16 non_NPC 317 1.0 0.0
16 non_NPC 253 1.0 0.0291534793859807