Using ViMRT

Welcome to the ViMRT docs!

Using ViMRT

Introduction

ViMRT is a text-mining tool and search engine for automated virus mutation recognition by rule patterns and regular expression patterns for different written forms of virus mutation in literature based on natural language processing. It can also quickly and accurately search virus mutation-related information including virus genes and related disease.

Getting Started with ViMRT

System Python

Before we start a quick note that using the system-wide installation of Python is not recommended. This often causes problems and it's a little risky to mess with it. If you find yourself prepending sudo to any ViMRT commands, take a step back and think about Python virtual environments / conda instead (see below).

Installing Python

To see if you have python installed, run python --version on the command line. ViMRT needs Python version 2.7+, 3.7+ or 3.8+.

We recommend using virtual environments to manage your Python installation. Our favourite is conda, a cross-platform tool to manage Python environments. You can installation instructions for Miniconda here.

Once conda is installed, you can create a Python environment with the following commands:

conda create --name py3.8 python=3.8
conda activate py3.8

You'll want to add the conda activate py3.8 line to your .bashrc file so that the environment is loaded every time you load the terminal.

Installing python package

Then you need to install python package to run the code as follows:

pip3 install -r requirements.txt

Virus mutation recognition

The recognition of virus mutation is mainly divided into two independent modules:
I. Optimize the recognition result of tmVar by rule patterns
II. Develop regular expression patterns to recognize virus mutation

Downloading BioC format

Each input file should follow the BioC format (From PubMed abstracts & PMC full text articles). The user can also download the BioC format by running the code below (if BioC files are ready, the user can skip this step).

python Bio_download.py -i [input] -o [output] -s [sources]

input: the user can provide the input file with PMID, such as PMIDlist.txt.
output: the user can provide the output folder path.
sources: the user can choose the output file source: PubMed | PMC | PubMed_PMC, which will obtain the PubMed abstracts, PMC full text articles or both, respectively)

Example: python Bio_download.py -i PMIDlist.txt -o ./tmVar/tmvar_input -s PubMed

Identifying mutation by ViMRT

It includes three steps:
1. Optimizing results of tmVar by rule patterns
2. Identifying mutation by regular expression patterns
3. Integrating the results of rule and regex

*Note: step 1 and step 2 can be run independently.

1. Optimizing results of tmVar by rule patterns

In this step, the user firstly needs to download tmVar to identify the mutation in the official website. tmVar can only run in a window environment or a linux environment. Please see the instructions for more details in zip files. The main code is as follows:

java -Xmx5G -Xms5G -jar tmVar.jar [input] [output]

input: the user can provide the input folder path.
output: the user can provide the output folder path.

*Note: each input file and output file of tmVar should follow the PubTator format or the BioC format, and if the input files are from PMC full text articles, the output files only have BioC format.

Then, the user can optimize the recognition results of tmVar via running a Python script:

python ViMRT.py -i [input] -o [output] -v [virus] -f [formart] -m rules

input: input folder path includes the output result files of tmVar.
output: the user can provide the output folder path. The default path is the current path.
virus: the user can provide one virus name. The default parameter is 'Unknown'.
formart: the user can choose one input file formart: PubTator or BioC. The default parameter is 'BioC'.

Example: python ViMRT.py -i ./tmVar_result/ -o ./ViMRT_result/ -v HBV -f BioC -m rules

*Note: ViMRT has designed the specific rules for optimizing the results of tmVar according the mutation written form of five viruses in the literature, including HBV, HPV, HIV, EBV, HTLV1. For example, sG145R was optimized as G145R for HBV mutation. For the virus parameter, if the parameter given is one of five virus names, its specific rules will be brought into optimization in the results.

2. Identifying mutation by regular expression patterns

Based on the development dataset and false positive results of the tmVar, ViMRT has developed regular expression patterns to recognize virus mutations from the original literature.

python ViMRT.py -i [input] -o [output] -v [virus] -f [formart] -m regex

input: input folder path includes the output result files of tmVar or BioC format files by downloading.
output: the user can provide the output folder path. The default path is the current path.
virus: the user can provide one virus name. The default parameter is 'Unknown'.
formart: the user can choose one input file format: PubTator or BioC. The default parameter is 'BioC'.

Example: python ViMRT.py -i ./tmVar_result/ -o ./ViMRT_result/ -v HBV -f BioC -m regex

*Note: ViMRT has designed the specific regular expression patterns to identify the virus mutation according to the written form in different literatures, including HBV, HPV, HIV, EBV, HTLV1. For example, HTLV1 M47 mutation will be identifeid as 'L319R' and 'L320S'. For the virus parameter, if the parameter given is one of five virus names, its specific regular expression patterns will be brought into recognition.

3. Integrating the results of rule and regex

If users separately recognize the mutation by rule and regex, they need to merge their results by running ViMRT.py as follows:

python ViMRT.py -i [input] -o [output] -f [formart] -c concat

input: input path includes identification result files by both rule and regex.
output: the user can provide the output folder path. The default path is the current path.
formart: the user can choose one input file format: PubTator or BioC. The default parameter is 'BioC'.

Example: python ViMRT.py -i ./ViMRT_result/ -o ./ViMRT_mutation/ -f BioC -c concat

At same time, the user can also directly run ViMRT.py to obtain both the optimization results and regular expression results from the output files of tmVar.

Identifying the mutation by rule and regex

By output result files of tmVar with BioC format:

python ViMRT.py -i ./tmVar_result/ -o ./ViMRT_mutation/ -v HBV -f BioC

This will generate 3 files: BioC_rules.csv, BioC_regex.csv and BioC_rules_regex.xlsx in the "BioC" folder of the output path. The BioC_rules_regex.xlsx file merges the results of BioC_rules.csv and BioC_regex.csv, which are the last mutation recognition results of ViMRT.

By the output result files of tmVar with PubTator format:

python ViMRT.py -i ./tmVar_result/ -o ./ViMRT_mutation/ -v HBV -f PubTator

It will generate 3 files: PubTator_rules.csv, PubTator_regex.csv and BioC_rules_regex.xlsx in the "PubTator" folder of the output path. The PubTator_rules_regex.xlsx file merges the results of PubTator_rules.csv and PubTator_regex.csv, which are the last mutation recognition results of ViMRT.

Virus gene recognition

ViMRT has built virus gene corpus from NCBI PubMed and gene database and developed a Python script to identify virus genes from virus mutation sentences.

Virus gene corpus

ViMRT has collected the gene name list of 7,194 viruses. Users can also add their own gene names at gene_vocabulary.txt file in genecorpus folder. Besides, ViMRT has further eliminated possible identification errors due to short virus gene names, eg., S gene of HBV in gene_vocabulary_error.txt file in genecorpus folder. Users can also complement errors according to their own needs.

Identifying virus gene

python Gene_Recognize.py  -i [input] -o [output] -v [virus]

input: the user can provide input file (gene_example.txt). The input file format: PMID+"|pmid|"+sentence
output: the user can provide output folder path. The default path is the current path.
virus: the user can choose virus name, such as HBV, HBV;HPV, etc. The default parameter is "fullvirus"

Example: python Gene_Recognize.py -i ./gene_example.txt -o ./ViMRT_gene/ -v HBV

*Note: we recommend selecting one virus name. Becuase the default parameter will match genes of all viruses in turn, which will run for a long time.

Disease recognition

ViMRT firstly needs a Python NLP Stanza library for many human languages to identify disease from virus mutation sentences. The stanza usage can refer to the github website.

Installing stanza

pip install stanza

Downloading stanza disease models

If users are running the stanza pipeline for the first time, they need to download stanza disease models:

import stanza 
stanza.download('en', package='mimic', processors={'ner': 'bc5cdr'}, verbose=False)
stanza.download('en', package='mimic', processors={'ner': 'ncbi_disease'}, verbose=False)

Identifying disease

Disease corpus

ViMRT has built disease corpus from CTD database to optimize the results of stanza using Python script. Users can also add their own disease name at disease_vocabulary.txt file and complement new errors at disease_vocabulary_error.txt file in diseasecorpus folder to delete identification errors of stanza.

Identifying and optimizing disease

python Disease_Recognize.py -i [input] -o [output]

input: the user can provide input file (disease_example.txt). The input file format: PMID+"|pmid|"+sentence.
output: the user can provide output folder path. The default path is the current path.

Example: python Disease_Recognize.py -i ./disease_example.txt -o ./ViMRT_disease/ -v HBV

Our Paper

For a more detailed introduction, please read our article:
ViMRT: a text-mining tool and search engine for automated virus mutation recognition.
Bioinformatics. 2023 Jan 1;39(1):btac721 (PMID: 36342236).

Table of Contents