Ssylla: Protein secondary structure prediction by an ensemble leveraging accuracy estimation

Spencer Krieger and John Kececioglu
July, 2020

Overview

Protein secondary structure prediction is a fundamental task in computational biology, basic to many bioinformatics workflows, with a diverse collection of tools currently available. An approach from machine learning with the potential to capitalize on such a collection is ensemble prediction, which combines multiple predictions into a single prediction, which is output by the ensemble. Ssylla combines predictions from several different prediction methods through the use of accuracy estimation. We evaluated several ensemble methods, and the most accurate ensemble--implemented in Ssylla--is a hybrid ensemble of the state-of-the-art tools Nnessy and Porter.

Methods

Ssylla contains source code for accuracy estimators for several state-of-the-art methods for protein secondary structure prediction and a hybrid ensemble between Nnessy and Porter that surpasses the accuracy of any single tool on standard benchmark datasets. In the same repository, we also provide the evaluation datasets referenced in our paper. Here is a brief description of the methods implemented in Ssylla, including how we score predictions using accuracy estimation, and how we combine two tools together into a hybrid ensemble. More detailed descriptions are available on pages linked in the menu, as well as in the publication linked below.

Accuracy estimation
Accuracy estimators take an output of the prediction procedure and map it to an estimated accuracy of the prediction. Ssylla provides accuracy estimators for state-of-the-art tools PSIPRED, JPred, SSpro, Porter, DeepCNF, and Nnessy.
Hybrid approach
Ssylla also provides a hybrid approach, which combines the template-based tool Nnessy with the template-free tool Porter. This hybrid approach takes the output prediction from Nnessy and compares the estimated accuracy of the prediction to a threshold. If this estimated accuracy exceeds the threshold, Nnessy's prediction is returned. If not, Porter's prediction is returned.
Results
On average over standard CASP and PDB benchmarks, the hybrid exceeds the state-of-the-art 3-state accuracy by nearly 4%, and exceeds the 8-state prediction by more than 8%.

Publication

The methods implemented in Ssylla are given in the following publication, which should be cited under noteworthy use of Ssylla

Spencer Krieger and John Kececioglu, “Predicting protein secondary structure by an ensemble through feature-based accuracy estimation.” Proceedings of the 11th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 29, 1-10, 2020.

Source code

Source code for accuracy estimators for state-of-the-art tools and the hybrid ensemble of Nnessy and Porter, along with documentation is available on GitHub.
Source code for our state-of-the-art tool for protein secondary structure prediction, Nnessy, is also available.

Video

The following video was presented at ACM-BCB 2020 and gives more detailed information on the methods implemented in Ssylla: