Background: Protein secondary structure prediction is a fundamental task in computational biology, basic to many bioinformatics workflows, with a diverse collection of tools currently available. An approach from machine learning with the potential to capitalize on such a collection is ensemble prediction, which runs multiple predictors, and combines their predictions into one output by the ensemble.
Accuracy estimation: Accuracy estimators take an output of the prediction procedure and map it to an estimated accuracy of the prediction. Ssylla provides accuracy estimators for state-of-the-art tools PSIPRED, JPred, SSpro, Porter, DeepCNF, and Nnessy. Hybrid approach: Ssylla also provides a hybrid approach, which combines the template-based tool Nnessy with the template-free tool Porter. This hybrid approach takes the output prediction from Nnessy and compares the estimated accuracy of the prediction to a threshold. If this estimated accuracy exceeds the threshold, Nnessy's prediction is returned. If not, Porter's prediction is returned.
Results: On average over standard CASP and PDB benchmarks, the hybrid exceeds the state-of-the-art 3-state accuracy by nearly 4%, and exceeds the 8-state prediction by more than 8%. In a careful study of ensemble methods, this is the most accurate ensemble method.
[Source Code on GitHub]
Citation: Noteworthy uses of Ssylla should cite the following publication: Spencer Krieger and John Kececioglu, "Predicting protein secondary structure by an ensemble through feature-based accuracy estimation", ACM's Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2020)
Funding: Research supported by the US National Science Foundation through grant CCF-1617192.