Prediction of protein solubility in Escherichia coli using logistic regression

@article{Diaz2010PredictionOP,
  title={Prediction of protein solubility in Escherichia coli using logistic regression},
  author={Armando A Diaz and Emanuele Tomba and Reese Lennarson and Rex Richard and Miguel J. Bagajewicz and R. G. Harrison},
  journal={Biotechnology and Bioengineering},
  year={2010},
  volume={105}
}
In this article we present a new and more accurate model for the prediction of the solubility of proteins overexpressed in the bacterium Escherichia coli. The model uses the statistical technique of logistic regression. To build this model, 32 parameters that could potentially correlate well with solubility were used. In addition, the protein database was expanded compared to those used previously. We tested several different implementations of logistic regression with varied results. The best… 
Predicting the solubility of recombinant proteins in Escherichia coli.
TLDR
A statistical model that uses binomial logistic regression for predicting the solubility of heterologous proteins expressed in E. coli in either soluble or insoluble form is described.
Solubility-Weighted Index: fast and accurate prediction of protein solubility
TLDR
It is discovered that global structural flexibility, which can be modeled by normalized B-factors, accurately predicts the solubility of 12 216 recombinant proteins expressed in Escherichia coli, and a new predictor is called the ‘Solubility-Weighted Index’ (SWI).
Solubility-Weighted Index: fast and accurate prediction of protein solubility
TLDR
The ‘SoDoPE’ (Soluble Domain for Protein Expression), a web interface that allows users to choose a protein region of interest for predicting and maximising both protein expression and solubility, is developed and outperforms many existing protein solubilty prediction tools.
Protein solubility is controlled by global structural flexibility
TLDR
This work has discovered that global structural flexibility, which can be modeled by normalised B-factors, accurately predicts the solubility of 12,216 recombinant proteins expressed in Escherichia coli.
SoluProt: prediction of soluble protein expression in Escherichia coli
TLDR
A new tool for sequence-based prediction of soluble protein expression in E.coli, SoluProt, was created using the gradient boosting machine technique with the TargetTrack database as a training set and its accuracy and AUC exceeded those of a suite of alternative solubility prediction tools.
Prediction of soluble heterologous protein expression levels in Escherichia coli from sequence-based features and its potential in biopharmaceutical process development
TLDR
The potential utility of this emergent technology to increase the efficiency of BD strategies and thereby to reduce the cost of establishing a process for soluble protein expression are critically examined.
Codon usage clusters correlation: towards protein solubility prediction in heterologous expression systems in E. coli
TLDR
A strong positive correlation between solubility and the degree of conservation of codons usage clusters is observed for two independent datasets and supports the notion that codon usage may dictate translation rate and modulate co-translational folding.
Improve Protein Solubility and Activity based on Machine Learning Models
TLDR
It is demonstrated that an optimization methodology based on machine learning prediction model can effectively predict which peptide tags can improve protein solubility quantitatively and provides a valuable tool for understanding the correlation between amino acid sequence and protein solubsility and for engineering protein biocatalysts.
Develop machine learning based predictive models for engineering protein solubility
TLDR
A novel approach that predicted protein solubility in continuous numerical values instead of binary ones was implemented, which enabled researchers to choose proteins with higher predicted solubilty for experimental validation, while binary values fail to distinguish proteins with the same value.
Develop machine learning-based regression predictive models for engineering protein solubility
TLDR
A novel approach that predicted protein solubility in continuous numerical values instead of binary ones was implemented, and can be used as a template for analysis of other expression andsolubility datasets.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 52 REFERENCES
Predicting the Solubility of Recombinant Proteins in Escherichia coli
TLDR
The cause of inclusion body formation in Escherichia coli grown at 37°C is studied using statistical analysis of the composition of 81 proteins that do and do not form inclusion bodies using composition derived parameters as the basis for the prediction.
Prediction of protein solubility in E. coli
TLDR
This work presents a framework that creates models of solubility from sequence information from the primary protein sequences of the genes to be synthesized, and provides the biologist with a comprehensive comparison between different learning algorithms, and insightful feature analysis.
A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli
TLDR
Six physicochemical properties together with residue and dipeptide-compositions have been used to develop a support vector machine-based classifier to predict the overexpression status in E.coli, and it performs reasonably well in predicting the propensity of a protein to be soluble or to form inclusion bodies.
Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli
TLDR
Thermostability, in vivo half‐life, Asn, Thr, and Tyr content, and tripeptide composition of a protein are correlated to the propensity of aprotein to be soluble on overexpression in E. coli.
New fusion protein systems designed to give soluble expression in Escherichia coli.
Three native E. coli proteins-NusA, GrpE, and bacterioferritin (BFR)-were studied in fusion proteins expressed in E. coli for their ability to confer solubility on a target insoluble protein at the
Production of soluble mammalian proteins in Escherichia coli: identification of protein features that correlate with successful expression
TLDR
Analysis of the protein features identified here will help predict which mammalian proteins and domains can be successfully expressed in E. coli as soluble product and also which are best targeted for a eukaryotic expression system.
Kinetic partitioning of protein folding and aggregation
TLDR
Dissection of the protein into six peptides corresponding to different regions of the sequence indicates that the kinetic partitioning between aggregation and folding can be attributed to the intrinsic conformational preferences of the denatured polypeptide chain.
Toward High-Resolution de Novo Structure Prediction for Small Proteins
TLDR
The prediction of protein structure from amino acid sequence is a grand challenge of computational molecular biology and the primary bottleneck to consistent high-resolution prediction appears to be conformational sampling.
Recombinant protein folding and misfolding in Escherichia coli
The past 20 years have seen enormous progress in the understanding of the mechanisms used by the enteric bacterium Escherichia coli to promote protein folding, support protein translocation and
Formation of Soluble Recombinant Proteins in Escherichia Coli is Favored by Lower Growth Temperature
TLDR
Lysates of non–transformed E. coli grown at either temperature rendered initially soluble human recombinant IFN–α2 insoluble at 37° but not at 0° or 30°C, and insolubilization was not abolished by nuclease treatment.
...
1
2
3
4
5
...