Accuracy estimation

Overview

Several of our tested ensemble variants, including the most accurate one--hybrid selection--use estimated accuracy to combine output predictions from the tools in the ensemble. Feature-based accuracy estimation is a new technique that was originally developed for estimating the unknown true accuracy of protein multiple sequence alignments. How we apply this technique to protein secondary structure predictions is somewhat different, as in this new context we need an estimate of the accuracy of the prediction at each residue of the protein (as opposed to a single estimate for an entire alignment), and to get a good estimator we will need to leverage additional information that is internal to the prediction method (as opposed to solely basing the estimate on information that is contained in the external structure prediction itself).


Methods

Here we explain the features we use and how we combine these features to get the final accuracy estimate.

Features
Given that we only use the output of the prediction procedure as inputs to our accuracy estimation, we use only two features: residue confidence and template similarity. Tools that provide confidence values at residues for their structure prediction typically output an integer from {0, 1, . . . , 9}, where a larger value reflects greater confidence in the residue’s predicted state. Tools that do not directly provide confidences fortunately instead often output estimated structure class membership probabilities at residues. These confidence values or predicted state values need to be normalized so we can compare them between tools.
For template-based tools, we measure the similarity of the template database match to the query sequence coming from the input protein around the residue position. The similarity between these two portions of amino acid sequence from the template match and the query sequence is either measured by average percent identity (the fraction of amino acids that agree) or average word distance (the average substitution dissimilarity score between the corresponding amino acids).
Final estimator
The final estimator use a two-step process: each individual feature is first mapped via a transformation into an initial accuracy estimator; and these initial estimators are then linearly combined into the final accuracy estimator.


Publication

The methods implemented in Ssylla are given in the following publication, which should be cited under noteworthy use of Ssylla
  • Spencer Krieger and John Kececioglu, “Predicting protein secondary structure by an ensemble through feature-based accuracy estimation.” Proceedings of the 11th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), 29, 1-10, 2020.


Video

The following video was presented at ACM-BCB 2020 and gives more detailed information on the accuracy estimators implemented in Ssylla: