Virtual Screening with Machine Learning: A Substitute for HTS?

Undoubtedly, HTS (high-throughput screening) is a powerful method for library screening.1) Pharmaceutical companies have developed and expanded their original library by internal synthesis and purchase from other institutes. But in a huge real library that quality control and storage cost are annoying issues. It is often the case with library-stored hits that their activity is not reproduced by repurified or resynthesized compound. We all have to be careful about the stability of library compounds and keeping the quality at a reasonable level.

As a compound screening method, virtual screening provides us an opportunity to identify a set of bioactive molecules in a rapid and cost-efficient way. The value of synthetically available virtual library is increasing in this age of big data utilization. We would like to mention PepMetics® virtual library consists of a readily accessible set of our original scaffold.
The problem of virtual library, in fact, is the accuracy and reliability of virtual screening for drug discovery, especially in the case of targeting a new protein or other biomolecules. Owing to the improvement of machine learning methodologies,

Here is a novel and flexible virtual screening platform named TAME-VS (Target-driven machine learning-enabled virtual screening) by the groups of Harvard University.2) As you can imagine from its name, this platform requires you to define a protein target by UniProt ID. Its applicability was accessed in a retrospective manner against 10 selected proteins. In addition, its potential in early-to-middle drug discovery stage is also described in the paper.

TAME-VS consists of 7 modules, starting from the input of the target ID of interest to data processing. Let us briefly walk through the workflow of this virtual screening platform.

1. Target expansion
2. Compound retrieval
3. Vectorization
4. Machine learning model training
5. Virtual screening
6. Post-VS analysis
7. Data processing

Target expansion performs extensive search of sequence homology by BLAST in order to simultaneously target similar proteins.
Compound retrieval aims for the extraction of reported active and inactive compounds from ChEMBL with the default cutoff of 1uM in Ki, IC50 and EC50. Their properties with chemical fingerprints including SMILES strings and InChI keys are grouped into active and inactive categories.

Vectorization conveys chemical fingerprint computation to four types: Morgan, AtomPair, Topological and Torsion, and MACCS. The user can select one or more of them according to the interest.

Machine learning model training trains supervised model by random forest and multilayer perception by the defined chemical fingerprint. TAME-VS employs several parameters for both RF and MLP in default and visualize the result of training by ROC (receiver operating curve) to quickly evaluate and compare the models’ quality and reliability.

Virtual screening applies the trained model to the input virtual library compounds. The prediction result is provided by scores of drug-likeness and physicochemical properties.
Post-VS analysis enables the result of virtual screening. The predicted scores are visualized and can be exported by CSV file in this module, consisted of QED (quantitative estimate of drug-likeness)3), MW (molecular weight), LogP, number of hydrogen bond donors, number of rotatable bonds.

Data processing reports top virtual hits by ensemble ranking calculation. The top 1% virtual hit compounds are selected from the result of virtual screening is summarized automatically.

The advantage of TAME-VS superficially is the utility as a comprehensive, target-oriented virtual screening platform. But it also has the potential of modular utility because the seven modules work separately. For instance, you can use vectorization module to convert a compound list into four chemical fingerprints. You may use compound retrieval module to find reported molecules and assay results from a target protein list. This platform not only works as a virtual screening platform but also a support for original virtual screening platform.

The authors’ approach of platform construction enables us to optimize the virtual screening methodology according to the nature of virtual library. It could be applied for virtual library expansion in a retrospective manner by combination with deep generation models. It will be interesting to try utilizing TAME-VS in a modulative manner to raise the prediction accuracy and productivity in drug discovery.


Scroll to top