Virus Multi-Semantic Annotation Dataset is an annotation dataset encompassing 8 types of viruses, featuring 38 external semantic types and 5 internal semantic types.

In-depth research into the characteristics of high-risk oncogenic viruses is of paramount scientific significance for the early prevention and control of related cancers and the development of effective vaccines. The mechanism of viral carcinogenesis involves numerous risk factors, including viral genomic variations, lifestyle, and environmental factors. Based on literature data of 8 oncogenic viruses(HPV, HIV, EBV, MCV, HCV, HTLV-1, HBV and KSHV), we have created a large-scale, semantically rich corpus of viral carcinogenic factors, including 551,706 abstracts and 2,973,169 entities using natural language processing technology combined with expert knowledge. A semantic filter is developed to improve the performance of entity recognition. Moreover, transcriptomic data related to oncogenic viruses were collected. We performed gene differential expression analysis, feature genes identification, and immune microenvironment analysis.