Descriptor importance of HIV-1 protease crystal structures for QSAR using random forest
Gene M. Ko1, email@example.com, A. Srinivas Reddy2, firstname.lastname@example.org, Sunil Kumar3,
email@example.com, and Rajni Garg1, firstname.lastname@example.org. (1) Computational Science Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182-1245, (2) Electrical and Computer Engineering Department, San Diego State University, 5500 Campanile Drive, C/O Sunil Kumar, San Diego, CA 92182-1309, (3) Electrical and Computer Engineering Department, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182-1309
Random forest (RF) is a machine learning classifier that comprises of a collection of unpruned classification trees generated by using bootstrap samples of the data with random feature selection. Unlike many other machine learning techniques, RF has the advantage of determining the importance of all the variables in the dataset. The crystal structures of 62 HIV-1 protease binding pockets complexed with one of the nine FDA approved protease inhibitors deposited in the Protein Data Bank were studied. Quantitative understanding of the nature of the binding pockets would drive us to design novel inhibitors for HIV-1 protease. The descriptors have been computed for the binding pocket of each crystal structure, yielding 462 constitutional, topological, geometric, electrostatic, and quantum mechanical descriptors which can be used for deriving the Quantitative structure-activity relationship (QSAR). The optimal tree size (ntree) using the default sampling parameter (mtry) of 21 was determined to be 334 with an out-of-bag error of 45.2%. Adjusting the mtry parameters using 334 trees consistently produced the same highly ranked descriptors in the top ranked group of features, which confirms the stability of the classifier trees. The top ranked descriptors will be used to derive a QSAR model for bioactivity prediction.