Class DistanceMatrixCalculator
DistanceMatrix
from a
MultipleSequenceAlignment
or other indirect distance infomation (RMSD).- Since:
- 4.1.1
- Author:
- Aleix Lafita
-
Method Summary
Modifier and TypeMethodDescriptionstatic <C extends Sequence<D>,
D extends Compound>
org.forester.evoinference.matrix.distance.DistanceMatrixdissimilarityScore
(MultipleSequenceAlignment<C, D> msa, SubstitutionMatrix<D> M) The dissimilarity score is the additive inverse of the similarity score (sum of scores) between two aligned sequences using a substitution model (Substitution Matrix).static <C extends Sequence<D>,
D extends Compound>
org.forester.evoinference.matrix.distance.DistanceMatrixThe fractional dissimilarity (D) is defined as the percentage of sites that differ between two aligned sequences.static <C extends Sequence<D>,
D extends Compound>
org.forester.evoinference.matrix.distance.DistanceMatrixfractionalDissimilarityScore
(MultipleSequenceAlignment<C, D> msa, SubstitutionMatrix<D> M) The fractional dissimilarity score (Ds) is a relative measure of the dissimilarity between two aligned sequences.static <C extends Sequence<D>,
D extends Compound>
org.forester.evoinference.matrix.distance.DistanceMatrixkimuraDistance
(MultipleSequenceAlignment<C, D> msa) The Kimura evolutionary distance (d) is a correction of the fractional dissimilarity (D) specially needed for large evolutionary distances.static <C extends Sequence<D>,
D extends Compound>
org.forester.evoinference.matrix.distance.DistanceMatrixpamMLdistance
(MultipleSequenceAlignment<C, D> msa) The PAM (Point Accepted Mutations) distance is a measure of evolutionary distance in protein sequences.static <C extends Sequence<D>,
D extends Compound>
org.forester.evoinference.matrix.distance.DistanceMatrixpercentageIdentity
(MultipleSequenceAlignment<C, D> msa) BioJava implementation for percentage of identity (PID).static <C extends Sequence<D>,
D extends Compound>
org.forester.evoinference.matrix.distance.DistanceMatrixpoissonDistance
(MultipleSequenceAlignment<C, D> msa) The Poisson (correction) evolutionary distance (d) is a function of the fractional dissimilarity (D), given by:d = -log(1 - D)
The gapped positons in the alignment are ignored in the calculation.static <C extends Sequence<D>,
D extends Compound>
org.forester.evoinference.matrix.distance.DistanceMatrixstructuralDistance
(double[][] rmsdMat, double alpha, double rmsdMax, double rmsd0) The structural distance (dS) uses the structural similarity (or dissimilarity) from a the structural alignment of two protein strutures.
-
Method Details
-
fractionalDissimilarity
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix fractionalDissimilarity(MultipleSequenceAlignment<C, D> msa) throws IOExceptionThe fractional dissimilarity (D) is defined as the percentage of sites that differ between two aligned sequences. The percentage of identity (PID) is the fraction of identical sites between two aligned sequences.D = 1 - PID
The gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation:PairwiseDistanceCalculator.calcFractionalDissimilarities(Msa)
- Parameters:
msa
- MultipleSequenceAlignment- Returns:
- DistanceMatrix
- Throws:
IOException
-
poissonDistance
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix poissonDistance(MultipleSequenceAlignment<C, D> msa) throws IOExceptionThe Poisson (correction) evolutionary distance (d) is a function of the fractional dissimilarity (D), given by:d = -log(1 - D)
The gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation:PairwiseDistanceCalculator.calcPoissonDistances(Msa)
- Parameters:
msa
- MultipleSequenceAlignment- Returns:
- DistanceMatrix
- Throws:
IOException
-
kimuraDistance
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix kimuraDistance(MultipleSequenceAlignment<C, D> msa) throws IOExceptionThe Kimura evolutionary distance (d) is a correction of the fractional dissimilarity (D) specially needed for large evolutionary distances. It is given by:d = -log(1 - D - 0.2 * D2)
The equation is derived by fitting the relationship between the evolutionary distance (d) and the fractional dissimilarity (D) according to the PAM model of evolution (it is an empirical approximation for the methodpamMLdistance(MultipleSequenceAlignment)
). The gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation:PairwiseDistanceCalculator.calcKimuraDistances(Msa)
.- Parameters:
msa
- MultipleSequenceAlignment- Returns:
- DistanceMatrix
- Throws:
IOException
-
percentageIdentity
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix percentageIdentity(MultipleSequenceAlignment<C, D> msa) BioJava implementation for percentage of identity (PID). Although the name of the method is percentage of identity, the DistanceMatrix contains the fractional dissimilarity (D), computed as D = 1 - PID.It is recommended to use the method
fractionalDissimilarity(MultipleSequenceAlignment)
instead of this one.- Parameters:
msa
- MultipleSequenceAlignment- Returns:
- DistanceMatrix
-
fractionalDissimilarityScore
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix fractionalDissimilarityScore(MultipleSequenceAlignment<C, D> msa, SubstitutionMatrix<D> M) The fractional dissimilarity score (Ds) is a relative measure of the dissimilarity between two aligned sequences. It is calculated as:Ds = sum( max(M) - Mai,bi ) / (max(M)-min(M)) ) / L
Where the sum through i runs for all the alignment positions, ai and bi are the AA at position i in the first and second aligned sequences, respectively, and L is the total length of the alignment (normalization).The fractional dissimilarity score (Ds) is in the interval [0, 1], where 0 means that the sequences are identical and 1 that the sequences are completely different.
Gaps do not have a contribution to the similarity score calculation (gap penalty = 0)
- Parameters:
msa
- MultipleSequenceAlignmentM
- SubstitutionMatrix for similarity scoring- Returns:
- DistanceMatrix
-
dissimilarityScore
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix dissimilarityScore(MultipleSequenceAlignment<C, D> msa, SubstitutionMatrix<D> M) The dissimilarity score is the additive inverse of the similarity score (sum of scores) between two aligned sequences using a substitution model (Substitution Matrix). The maximum dissimilarity score is taken to be the maximum similarity score between self-alignments (each sequence against itself). Calculation of the score is as follows:Ds = maxScore - sumi(Mai,bi)
It is recommended to use the methodfractionalDissimilarityScore(MultipleSequenceAlignment, SubstitutionMatrix)
, since the maximum similarity score is not relative to the data set, but relative to the Substitution Matrix, and the score is normalized by the alignment length (fractional).Gaps do not have a contribution to the similarity score calculation (gap penalty = 0).
- Parameters:
msa
- MultipleSequenceAlignmentM
- SubstitutionMatrix for similarity scoring- Returns:
- DistanceMatrix
-
pamMLdistance
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix pamMLdistance(MultipleSequenceAlignment<C, D> msa) The PAM (Point Accepted Mutations) distance is a measure of evolutionary distance in protein sequences. The PAM unit represents an average substitution rate of 1% per site. The fractional dissimilarity (D) of two aligned sequences is related with the PAM distance (d) by the equation:D = sum(fi * (1 - Miid))
Where the sum is for all 20 AA, fi denotes the natural fraction of the given AA and M is the substitution matrix (in this case the PAM1 matrix).To calculate the PAM distance between two aligned sequences the maximum likelihood (ML) approach is used, which consists in finding d that maximazies the function:
L(d) = product(fai * (1 - Mai,bid))
Where the product is for every position i in the alignment, and ai and bi are the AA at position i in the first and second aligned sequences, respectively.- Parameters:
msa
- MultipleSequenceAlignment- Returns:
-
structuralDistance
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix structuralDistance(double[][] rmsdMat, double alpha, double rmsdMax, double rmsd0) The structural distance (dS) uses the structural similarity (or dissimilarity) from a the structural alignment of two protein strutures. It is based on the diffusive model for protein fold evolution (Grishin 1995). The structural deviations are captured as RMS deviations.dSij = (rmsdmax2 / alpha2) * ln( (rmsdmax2 - rmsd02) / (rmsdmax2 - (rmsdij2) )
- Parameters:
rmsdMat
- RMSD matrix for all structure pairs (symmetric matrix)alpha
- change in CA positions introduced by a single AA substitution (Grishin 1995: 1 A)rmsdMax
- estimated RMSD between proteins of the same fold when the percentage of identity is infinitely low (the maximum allowed RMSD of proteins with the same fold). (Grishin 1995: 5 A)rmsd0
- arithmetical mean of squares of the RMSD for identical proteins (Grishin 1995: 0.4 A)- Returns:
- DistanceMatrix
-