1. What is Rhapsody?

    Rhapsody is a machine learning tool for predicting the impact of amino acid substitutions in proteins. It consists of a random forest classifier trained not only on traditional conservation properties, but also on structural and dynamical properties of the mutation site, localized on the protein's PDB structure, and coevolution properties, extracted from Pfam sequence alignments.

  2. What kind of variants can Rhapsody analyze?

    Rhapsody can provide predictions for Single Amino acid Variants (SAVs) in human proteins for which PDB structures are available.

  3. Why only human SAVs?

    Because Rhapsody derives sequence conservation properties from PolyPhen-2, which is designed to work only for human SAVs.

  4. What are the accepted input formats?

    Rhapsody only accepts SAVs in Uniprot coordinates, with the format:
    <Uniprot ID> <position> <wild-type aa> <mutated aa> .
    For instance, mutation Q99R in human protein GTPase HRas can be queried by submitting the input string  P01112 99 Q R  or  RASH_HUMAN 99 Q R .
    We provide a Uniprot search tool to help with the identification of a sequence's unique accession number. When running an in silico saturation mutagenesis analysis, only the Uniprot sequence identifier (plus, optionally, a specific position) should be provided.

  5. What does "in silico saturation mutagenesis" mean?

    A complete scanning of all possible 19 amino acid substitutions at every position in a protein sequence. The result will be a "saturation mutagenesis table" (see example) that not only contains predictions for individual mutations, but also provides a general view of the parts in the sequence that are predicted to be more (or less) sensitive to mutations.

  6. What is a "batch query"?

    A batch query allows to submit a list of individual variants from a single or multiple protein sequences. The list must contain one variant per line, in Uniprot coordinates.

  7. What if there is no PDB structure for a given protein?

    Normally, when queried with a sequence, Rhapsody searches the Protein Data Bank for the "best" (i.e. the largest) structure available. If a structure is not found, the user can manually provide a custom protein structure, by either indicating a PDB code (for instance, of a homologous protein from another organism) or uploading a file in PDB format (e.g. downloaded from the SWISS-MODEL repository of homology models, see ROMK tutorial for an example). This option can also be used to run predictions on a particular protein structure or conformation (see HRAS tutorial for an example). Please note that Rhapsody will automatically align the Uniprot sequence to the PDB sequence and compute predictions only for matching amino acids: if the two sequences are too dissimilar, the resulting predictions might be too sparse.

  8. What does it mean to include "environmental effects"?

    When computing structural and dynamical features from a PDB structure, by default Rhapsody will only consider a single chain (the one with higher sequence similarity with the given Uniprot sequence) and ignore other chains that might be present in the PDB file. Sometimes, for instance in the case of multimers or other complexes, the presence of other chains should not be ignored and those properties should be computed for the entire complex. This is done by using a variant of Elastic Network Model called "environmental ANM" (more precisely, a "sliced" model, see main publication and ROMK tutorial). In conclusion, environmental effects should be included if the chain of interest is part of a "stable" complex (e.g. a multimer) and as such its dynamical properties are influenced and determined by the presence of other chains. On the other hand, please be aware that computing predictions on large complexes will take a significantly longer time.

  9. What is the difference between "full" and "reduced" classifiers?

    Both "full" and "reduced" classifiers are trained on sequence-, structure- and dynamics-based features. The main difference is that the "full" classifier also includes coevolutionary properties computed on Pfam multiple sequence alignments. If part of a sequence is not covered by a Pfam domain, predictions from the "reduced" classifier are returned instead.

  10. What is the "full+EVmutation" classifier?

    The "full+EVmutation" classifier includes in its list of features used for predictions the "epistatic statistical energy difference of mutant", computed by EVmutation and based on coevolution analysis of multiple sequence alignments. Although it has been shown to slightly improve the accuracy of predictions (see Rhapsody paper), by default this additional feature is not included in order to provide predictions that are independent from those computed by EVmutation. EVmutation predictions alone are always displayed in the final results along with those from Rhapsody and PolyPhen-2.

  11. What is displayed in the output files?

    1. Rhapsody predictions (simple view): contains "combined" predictions from "full" and "reduced" Rhapsody classifiers. The latter returns "backup" predictions whenever the primary classifier cannot be applied for lack of Pfam domains, used for computing coevolutionary features.
      • Column training info indicates whether a variant was never seen by the classifier (new), thus its prediction can be considered genuine, or was included in the training dataset (known_del or known_neu), thus its prediction cannot be considered unbiased.
      • Column score contains the output from the random forest classifier, a real number between 0 and 1.
      • Column prob. contains a "pathogenicity probability" calculated by applying a non-linear monotonic transformation to the random forest score that eliminates the effect of an imbalanced training dataset (where deleterious labels usually dominate). After this operation, the threshold between neutral and deleterious predictions can be set at 0.5.
      • Column class provides a final classification of variants into neutral and deleterious.
      • The last columns on the right contain predicted scores and classes from PolyPhen-2 and EVmutation.
    2. Rhapsody predictions (detailed view): contains predicted scores, probabilities and classes from both the "full" (main) or "reduced" (aux.) classifiers, as explained above. A left arrow between the two sets of columns indicates that "reduced" predictions replace missing "full" predictions in the "combined" results mentioned above.
    3. PolyPhen-2 output: contains the output from PolyPhen-2 web tool.
    4. PDB mapping: contains the mapping of variants from Uniprot coordinates to PDB structures, if possible. The column on the left contains the input Uniprot coordinates, while the second one provides the most up-to-date sequence IDs, as retrieved from Uniprot.
    5. computed features: lists the values of each feature for all input variants.
    6. log file: reports the detailed log of the submitted job.