Before starting the project, we need to prepare the sequencing files in fastq format in the path ./data/fastqraw . We also need the reference sequence file and the sample information file in ./public/ .
Setting the file path:
project="/home/feature_extraction4seq/"#change to your path
fq=${project}/data/fastqraw/
public=${project}/public/
result=${project}/result/
Batch decompression of paired-end sequencing data files is performed as follows
cd ${project}/data/fastqraw/
tail -n +2 ${public}/sample/sampleinfo.txt | cut -f 1 | while read id
do
gunzip ${project}/data/fastqraw/${id}_2.fastq.gz
gunzip ${project}/data/fastqraw/${id}_1.fastq.gz
done
or:
cd ${project}/data/fastqraw/
for i in *.fastq.gz;
do
gunzip $i
done
The original sequencing data was subjected to quality control using FastQC software, including assessing base quality, sequencing depth, GC content, and adapter content.
outdir=${project}//data/fastqc
log=${project}//data/QClog
mkdir $outdir
mkdir $log
for i in $fq/*.fastq;
do
sample=${i%_clean*}
sample=${sample##*/}
fastqc -o $outdir $i > $log/$sample.log
done;
If the results of quality control are unsatisfactory, further removal of sequence adapters is required. The trim_galore software was used to remove adapter sequences from both read1 and read2 of the paired-end sequencing data. The results are saved at ./data/clean/. The quality control results after adapter trimming are saved in ./data/fastqc_lean.
mkdir ${fq}/clean/
outdir_clean=${project}//data/fastqc_clean
mkdir $outdir_clean
tail -n +2 ${public}/sample/sampleinfo.txt | cut -f 1 | while read id
do
# QC
#fastqc ${fq}/${id}_1.fastq ${fq}/${id}_2.fastq -o
# cut adapters
trim_galore --cores 2 -q 20 \
--phred33 --stringency 3 --length 20 -e 0.1 \
--paired ${fq}/${id}_1.fastq ${fq}/${id}_2.fastq \
--gzip -o ${project}/data/clean/
#QC again
fastqc ${project}/data/clean/${id}_1_val_1.fq.gz ${project}/data/clean/${id}_2_val_2.fq.gz -o $outdir_clean
done
The Burrows-Wheeler Aligner (BWA) software was used to build an index file for the reference sequence.
cd ${public}/reference/
bwa index -a bwtsw ref_seq.fasta
The Fast Length Adjustment of Short Reads (FLASH) software was used to merge the quality-controlled short sequences from Read1 and Read2, generating longer fragment sequences. If the quality control results after adapter trimming are improved, the path \${fq}\${id}_1.fastq and \${fq}\${id}_1.fastq can be changed accordingly.
mkdir ${project}/data/fastqbind/
cd ${project}/data/fastqbind/
tail -n +2 ${public}/sample/sampleinfo.txt | cut -f 1 | while read id
do
flash ${fq}${id}_1.fastq ${fq}${id}_2.fastq -p 33 -r 250 -f 500 -s 100 -o ${id}
done
cd ${project}
For successfully merged reads, BWA was then used to align them with the reference sequence to extract the aligned sequences. For those unmerged reads from Read1 and Read2, BWA was used to align Read1 and Read2 separately with the reference sequence, and SAMtools software was subsequently conducted to extract the sequences that aligned at both ends. If there were duplicate regions in the sequences aligned at both ends, the duplicates were removed and the sequences were merged into one. On the contrary, Read1 and Read2 sequences were kept.
BWA for successfully merged reads, and the results are saved in ./result/bwasam/:
mkdir $result
mkdir $result/bwasam/
#cd ${project}/data/fastqbind/
tail -n +2 ${public}/sample/sampleinfo.txt | cut -f 1 | while read id
do
bwa mem -M ${public}/reference/ref_seq.fasta ${project}/data/fastqbind/${id}.extendedFrags.fastq > $result/bwasam/${id}.sam
done
BWA for unmerged reads, and the results are saved in ./result/unbindsam/:
mkdir $result/unbindsam/
#cd ${project}/data/fastqbind/
tail -n +2 ${public}/sample/sampleinfo.txt | cut -f 1 | while read id
do
echo $id
bwa mem -M ${public}/reference/ref_seq.fasta ${project}/data/fastqbind/${id}.notCombined_1.fastq ${project}/data/fastqbind/${id}.notCombined_2.fastq > $result/unbindsam/${id}_unbind.sam
done
This process will generate multiple intermediate files. The final results, which include the mapping information of reads to the reference sequence for each sample, will be saved in ./result/finaldata/seq/.
# Extract the bwa results from *./result/bwasam/*
python3 deal.py
# Extract the bwa results from *./result/unbindsam/*. If the folder *unbindsam* is empty, this step can be skipped.
python3 dealunbind.py
python3 againunbind.py
#Integration of the extraction results
python3 dealall.py
For sequence feature calculation, the nucleotide-level mutation and Shannon entropy algorithms were selected. The results will separately output feature matrices for sequencing depths of [10,30,50,75,100,150,200,250,300,350,400,450,500], saved in ./feature/data/sampledata.
mkdir $project/feature/
cd $project/feature/
#Calculate sequence features for each sample
python3 feature.py
#Merge sequence features from all samples.
python3 bind.py