public class DistanceMatrixCalculator extends Object
DistanceMatrix
from a
MultipleSequenceAlignment
or other indirect distance infomation (RMSD).Modifier and Type  Method and Description 

static <C extends Sequence<D>,D extends Compound> 
dissimilarityScore(MultipleSequenceAlignment<C,D> msa,
SubstitutionMatrix<D> M)
The dissimilarity score is the additive inverse of the similarity score
(sum of scores) between two aligned sequences using a substitution model
(Substitution Matrix).

static <C extends Sequence<D>,D extends Compound> 
fractionalDissimilarity(MultipleSequenceAlignment<C,D> msa)
The fractional dissimilarity (D) is defined as the percentage of sites
that differ between two aligned sequences.

static <C extends Sequence<D>,D extends Compound> 
fractionalDissimilarityScore(MultipleSequenceAlignment<C,D> msa,
SubstitutionMatrix<D> M)
The fractional dissimilarity score (Ds) is a relative measure of the
dissimilarity between two aligned sequences.

static <C extends Sequence<D>,D extends Compound> 
jointSeqStrucDistance(double[][] rmsdMat)
The joint sequencestructure distance (d_{SS}) is a combination
of the sequencebased and the structurebased distances.

static <C extends Sequence<D>,D extends Compound> 
kimuraDistance(MultipleSequenceAlignment<C,D> msa)
The Kimura evolutionary distance (d) is a correction of the fractional
dissimilarity (D) specially needed for large evolutionary distances.

static <C extends Sequence<D>,D extends Compound> 
pamMLdistance(MultipleSequenceAlignment<C,D> msa)
The PAM (Point Accepted Mutations) distance is a measure of evolutionary
distance in protein sequences.

static <C extends Sequence<D>,D extends Compound> 
percentageIdentity(MultipleSequenceAlignment<C,D> msa)
BioJava implementation for percentage of identity (PID).

static <C extends Sequence<D>,D extends Compound> 
poissonDistance(MultipleSequenceAlignment<C,D> msa)
The Poisson (correction) evolutionary distance (d) is a function of the
fractional dissimilarity (D), given by:
d = log(1  D)
The gapped positons in the alignment are ignored in the calculation.

static <C extends Sequence<D>,D extends Compound> 
structuralDistance(double[][] rmsdMat,
double alpha,
double rmsdMax,
double rmsd0)
The structural distance (d_{S}) uses the structural similarity
(or dissimilarity) from a the structural alignment of two protein
strutures.

public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix fractionalDissimilarity(MultipleSequenceAlignment<C,D> msa) throws IOException
D = 1  PIDThe gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation:
PairwiseDistanceCalculator.calcFractionalDissimilarities(Msa)
msa
 MultipleSequenceAlignmentException
IOException
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix poissonDistance(MultipleSequenceAlignment<C,D> msa) throws IOException
d = log(1  D)The gapped positons in the alignment are ignored in the calculation. This method is a wrapper to the forester implementation of the calculation:
PairwiseDistanceCalculator.calcPoissonDistances(Msa)
msa
 MultipleSequenceAlignmentIOException
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix kimuraDistance(MultipleSequenceAlignment<C,D> msa) throws IOException
d = log(1  D  0.2 * D^{2})The equation is derived by fitting the relationship between the evolutionary distance (d) and the fractional dissimilarity (D) according to the PAM model of evolution (it is an empirical approximation for the method
#pamDistance(MultipleSequenceAlignment
). The gapped
positons in the alignment are ignored in the calculation. This method is
a wrapper to the forester implementation of the calculation:
PairwiseDistanceCalculator.calcKimuraDistances(Msa)
.msa
 MultipleSequenceAlignmentIOException
public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix percentageIdentity(MultipleSequenceAlignment<C,D> msa)
It is recommended to use the method
fractionalDissimilarity(MultipleSequenceAlignment)
instead of this one.
msa
 MultipleSequenceAlignmentpublic static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix fractionalDissimilarityScore(MultipleSequenceAlignment<C,D> msa, SubstitutionMatrix<D> M)
Ds = sum( max(M)  M_{ai,bi} ) / (max(M)min(M)) ) / LWhere the sum through i runs for all the alignment positions, ai and bi are the AA at position i in the first and second aligned sequences, respectively, and L is the total length of the alignment (normalization).
The fractional dissimilarity score (Ds) is in the interval [0, 1], where 0 means that the sequences are identical and 1 that the sequences are completely different.
Gaps do not have a contribution to the similarity score calculation (gap penalty = 0)
msa
 MultipleSequenceAlignmentM
 SubstitutionMatrix for similarity scoringpublic static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix dissimilarityScore(MultipleSequenceAlignment<C,D> msa, SubstitutionMatrix<D> M)
Ds = maxScore  sum_{i}(M_{ai,bi})It is recommended to use the method
fractionalDissimilarityScore(MultipleSequenceAlignment, SubstitutionMatrix)
, since the maximum similarity score is not relative to the data set, but
relative to the Substitution Matrix, and the score is normalized by the
alignment length (fractional).
Gaps do not have a contribution to the similarity score calculation (gap penalty = 0).
msa
 MultipleSequenceAlignmentM
 SubstitutionMatrix for similarity scoringpublic static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix pamMLdistance(MultipleSequenceAlignment<C,D> msa)
D = sum(fi * (1  M_{ii}^{d}))Where the sum is for all 20 AA, fi denotes the natural fraction of the given AA and M is the substitution matrix (in this case the PAM1 matrix).
To calculate the PAM distance between two aligned sequences the maximum likelihood (ML) approach is used, which consists in finding d that maximazies the function:
L(d) = product(f_{ai} * (1  M_{ai,bi}^{d}))Where the product is for every position i in the alignment, and ai and bi are the AA at position i in the first and second aligned sequences, respectively.
msa
 MultipleSequenceAlignmentpublic static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix structuralDistance(double[][] rmsdMat, double alpha, double rmsdMax, double rmsd0)
d_{Sij} = (rmsd_{max}^{2} / alpha^{2}) * ln( (rmsd_{max}^{2}  rmsd_{0}^{2}) / (rmsd_{max}^{2}  (rmsd_{ij}^{2}) )
rmsdMat
 RMSD matrix for all structure pairs (symmetric matrix)alpha
 change in CA positions introduced by a single AA substitution
(Grishin 1995: 1 A)rmsdMax
 estimated RMSD between proteins of the same fold when the
percentage of identity is infinitely low (the maximum allowed
RMSD of proteins with the same fold). (Grishin 1995: 5 A)rmsd0
 arithmetical mean of squares of the RMSD for identical
proteins (Grishin 1995: 0.4 A)public static <C extends Sequence<D>,D extends Compound> org.forester.evoinference.matrix.distance.DistanceMatrix jointSeqStrucDistance(double[][] rmsdMat)
rmsdMat
 RMSD matrix for all structure pairs (symmetric matrix)Copyright © 2000–2017 BioJava. All rights reserved.