• Publications
  • Influence
Fastq_clean: An optimized pipeline to clean the Illumina sequencing data with quality control
An optimized pipeline Fastq_clean is presented to clean the DNA-seq and RNA-seq data from the illumina sequencer and it can be used toclean the NGS data from other sequencers, but needs some modification to reach the rest performance. Expand
On the prediction of DNA-binding proteins only from primary sequences: A deep learning approach
A deep learning based method to identify DNA-binding proteins from primary sequences alone that utilizes two stages of convolutional neutral network to detect the function domains of protein sequences, and the long short-term memory neural network to identify their long term dependencies, an binary cross entropy to evaluate the quality of the neural networks. Expand
Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree
A method for predicting protein interactions making full use of physicochemical characteristics of amino acids using the gradient boosting decision tree and the mutil-scale feature representation scheme, which might be a useful tool for future proteomics studies. Expand
Deep Neural Network Based Predictions of Protein Interactions Using Primary Sequences
A deep neural network framework (DNN-PPI) for predicting PPIs using features learned automatically only from protein primary sequences has remarkable generalization and is a promising tool for identifying protein interactions. Expand
A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers
A model stacking framework by orchestrating multi-view features and classifiers (MSFBinder) to investigate how to integrate and evaluate loosely-coupled models for predicting DNA-binding proteins and suggests that the framework is able to orchestrate various predicted models flexibly with good performances. Expand
Learning Bayesian Network Structure from Distributed Homogeneous Data
An algorithm: parallel three-phase dependency analysis (P-TPDA), for learning the structure of Bayesian network from distributed homogenous datasets: each of which has same variables. Expand
On the PAC-Bayes Bound Calculation based on Reproducing Kernel Hilbert Space
PAC-Bayes risk bound combining Bayesian theory and structure risk minimization for stochastic classifiers has been con- sidered as a framework for deriving some of the tightest generalization bounds.Expand
BAAQ: An Infrastructure for Application Integration and Knowledge Discovery in Bioinformatics
This paper addresses two issues in building grid applications in bioinformatics: how to smoothly compose an analysis workflow using heterogeneous resources and how to efficiently discover and re-use available resources in the grid community. Expand
An Improved Algorithm for K-anonymity
An improved algorithm based on OLA (Optimal Lattice Anonymization), which introduces the conception of support from data mining, and augmented the structure of generalization hierarchy associated with the information of support, so that there is no need to scan the entire data table repeatedly and all k-anonymous nodes are found more efficiently. Expand
On Service Discovery for Online Data Mining Trails
A service discovery system to tackle complex data mining requirements in a grid enabled online data mining environment is proposed that adopts the state-of-the-art open standards to represent data mining algorithms and models. Expand