Dong-Sheng Cao

Learn More
SUMMARY Sequence-derived structural and physiochemical features have been frequently used for analysing and predicting structural, functional, expression and interaction profiles of proteins and peptides. To facilitate extensive studies of proteins and peptides, we developed a freely available, open source python package called protein in python (propy) for(More)
Data Mining is a technique to extract the hidden knowledge of information. Among several data mining methods classification is especially useful in the field of medical diagnosis for decision making. In this study, a hybrid approach: CART decision tree classifier with feature selection and boosting ensemble method has been considered to evaluate the(More)
Traditional Chinese medicine (TCM) has unique therapeutic effects for complex chronic diseases. However, for the lack of an effective systematic approach, the research progress on the effective substances and pharmacological mechanism of action has been very slow. In this paper, by incorporating network biology, bioinformatics and chemoinformatics methods,(More)
UNLABELLED Amino acid sequence-derived structural and physiochemical descriptors are extensively utilized for the research of structural, functional, expression and interaction profiles of proteins and peptides. We developed protr, a comprehensive R package for generating various numerical representation schemes of proteins and peptides from amino acid(More)
The crucial step of building a high performance QSAR/QSPR model is the detection of outliers in the model. Detecting outliers in a multivariate point cloud is not trivial, especially when several outliers coexist in the model. The classical identification methods do not always identify them, because they are based on the sample mean and covariance matrix(More)
UNLABELLED In chemoinformatics and bioinformatics fields, one of the main computational challenges in various predictive modeling is to find a suitable way to effectively represent the molecules under investigation, such as small molecules, proteins and even complex interactions. To solve this problem, we developed a freely available R/Bioconductor package,(More)
Aqueous solubility of drug compounds plays a very important role in drug research and development. In this study, we have collected 225 diverse druglike molecules with accurate aqueous solubility. Three commonly used methods, namely partial least squares (PLS), back-propagation network (BPN) and support vector regression (SVR), were employed to model(More)
To build a credible model for given chemical or biological or clinical data, it may be helpful to first get somewhat better insight into the data itself before modeling and then to present the statistically stable results derived from a large number of sub-models established only on one dataset with the aid of Monte Carlo Sampling (MCS). In the present(More)
BACKGROUND Molecular descriptors and fingerprints have been routinely used in QSAR/SAR analysis, virtual drug screening, compound search/ranking, drug ADME/T prediction and other drug discovery processes. Since the calculation of such quantitative representations of molecules may require substantial computational skills and efforts, several tools have been(More)