public class DistanceMatrixCalculator extends Object
DistanceMatrix
from a
MultipleSequenceAlignment
or other indirect distance infomation (RMSD).Modifier and Type | Method and Description |
---|---|
static <C extends Sequence<D>,D extends Compound> |
dissimilarityScore(MultipleSequenceAlignment<C,D> msa,
SubstitutionMatrix<D> M)
The dissimilarity score is the additive inverse of the similarity score
(sum of scores) between two aligned sequences using a substitution model
(Substitution Matrix).
|
static <C extends Sequence<D>,D extends Compound> |
fractionalDissimilarity(MultipleSequenceAlignment<C,D> msa)
The fractional dissimilarity (D) is defined as the percentage of sites
that differ between two aligned sequences.
|
static <C extends Sequence<D>,D extends Compound> |
fractionalDissimilarityScore(MultipleSequenceAlignment<C,D> msa,
SubstitutionMatrix<D> M)
The fractional dissimilarity score (Ds) is a relative measure of the
dissimilarity between two aligned sequences.
|
static <C extends Sequence<D>,D extends Compound> |
kimuraDistance(MultipleSequenceAlignment<C,D> msa)
The Kimura evolutionary distance (d) is a correction of the fractional
dissimilarity (D) specially needed for large evolutionary distances.
|
static <C extends Sequence<D>,D extends Compound> |
pamMLdistance(MultipleSequenceAlignment<C,D> msa)
The PAM (Point Accepted Mutations) distance is a measure of evolutionary
distance in protein sequences.
|
static <C extends Sequence<D>,D extends Compound> |
percentageIdentity(MultipleSequenceAlignment<C,D> msa)
BioJava implementation for percentage of identity (PID).
|
static <C extends Sequence<D>,D extends Compound> |
poissonDistance(MultipleSequenceAlignment<C,D> msa)
The Poisson (correction) evolutionary distance (d) is a function of the
fractional dissimilarity (D), given by:
d = -log(1 - D)
The gapped positons in the alignment are ignored in the calculation.
|
static <C extends Sequence<D>,D extends Compound> |
structuralDistance(double[][] rmsdMat,
double alpha,
double rmsdMax,
double rmsd0)
The structural distance (dS) uses the structural similarity
(or dissimilarity) from a the structural alignment of two protein
strutures.
|
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix fractionalDissimilarity(MultipleSequenceAlignment<C,D> msa) throws IOException
D = 1 - PIDThe gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation:
PairwiseDistanceCalculator.calcFractionalDissimilarities(Msa)
msa
- MultipleSequenceAlignmentException
IOException
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix poissonDistance(MultipleSequenceAlignment<C,D> msa) throws IOException
d = -log(1 - D)The gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation:
PairwiseDistanceCalculator.calcPoissonDistances(Msa)
msa
- MultipleSequenceAlignmentIOException
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix kimuraDistance(MultipleSequenceAlignment<C,D> msa) throws IOException
d = -log(1 - D - 0.2 * D2)The equation is derived by fitting the relationship between the evolutionary distance (d) and the fractional dissimilarity (D) according to the PAM model of evolution (it is an empirical approximation for the method
#pamDistance(MultipleSequenceAlignment
). The gapped
positons in the alignment are ignored in the calculation. This method is
a wrapper to the forester implementation of the calculation:
PairwiseDistanceCalculator.calcKimuraDistances(Msa)
.msa
- MultipleSequenceAlignmentIOException
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix percentageIdentity(MultipleSequenceAlignment<C,D> msa)
It is recommended to use the method
fractionalDissimilarity(MultipleSequenceAlignment)
instead of this one.
msa
- MultipleSequenceAlignmentpublic static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix fractionalDissimilarityScore(MultipleSequenceAlignment<C,D> msa, SubstitutionMatrix<D> M)
Ds = sum( max(M) - Mai,bi ) / (max(M)-min(M)) ) / LWhere the sum through i runs for all the alignment positions, ai and bi are the AA at position i in the first and second aligned sequences, respectively, and L is the total length of the alignment (normalization).
The fractional dissimilarity score (Ds) is in the interval [0, 1], where 0 means that the sequences are identical and 1 that the sequences are completely different.
Gaps do not have a contribution to the similarity score calculation (gap penalty = 0)
msa
- MultipleSequenceAlignmentM
- SubstitutionMatrix for similarity scoringpublic static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix dissimilarityScore(MultipleSequenceAlignment<C,D> msa, SubstitutionMatrix<D> M)
Ds = maxScore - sumi(Mai,bi)It is recommended to use the method
fractionalDissimilarityScore(MultipleSequenceAlignment, SubstitutionMatrix)
, since the maximum similarity score is not relative to the data set, but
relative to the Substitution Matrix, and the score is normalized by the
alignment length (fractional).
Gaps do not have a contribution to the similarity score calculation (gap penalty = 0).
msa
- MultipleSequenceAlignmentM
- SubstitutionMatrix for similarity scoringpublic static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix pamMLdistance(MultipleSequenceAlignment<C,D> msa)
D = sum(fi * (1 - Miid))Where the sum is for all 20 AA, fi denotes the natural fraction of the given AA and M is the substitution matrix (in this case the PAM1 matrix).
To calculate the PAM distance between two aligned sequences the maximum likelihood (ML) approach is used, which consists in finding d that maximazies the function:
L(d) = product(fai * (1 - Mai,bid))Where the product is for every position i in the alignment, and ai and bi are the AA at position i in the first and second aligned sequences, respectively.
msa
- MultipleSequenceAlignmentpublic static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix structuralDistance(double[][] rmsdMat, double alpha, double rmsdMax, double rmsd0)
dSij = (rmsdmax2 / alpha2) * ln( (rmsdmax2 - rmsd02) / (rmsdmax2 - (rmsdij2) )
rmsdMat
- RMSD matrix for all structure pairs (symmetric matrix)alpha
- change in CA positions introduced by a single AA substitution
(Grishin 1995: 1 A)rmsdMax
- estimated RMSD between proteins of the same fold when the
percentage of identity is infinitely low (the maximum allowed
RMSD of proteins with the same fold). (Grishin 1995: 5 A)rmsd0
- arithmetical mean of squares of the RMSD for identical
proteins (Grishin 1995: 0.4 A)Copyright © 2000–2019 BioJava. All rights reserved.