Class StructureSequenceMatcher


  • public class StructureSequenceMatcher
    extends Object
    A utility class with methods for matching ProteinSequences with Structures.
    Author:
    Spencer Bliven
    • Method Detail

      • getSubstructureMatchingProteinSequence

        public static Structure getSubstructureMatchingProteinSequence​(ProteinSequence sequence,
                                                                       Structure wholeStructure)
        Get a substructure of wholeStructure containing only the Groups that are included in sequence. The resulting structure will contain only ATOM residues; the SEQ-RES will be empty. The Chains of the Structure will be new instances (cloned), but the Groups will not.
        Parameters:
        sequence - The input protein sequence
        wholeStructure - The structure from which to take a substructure
        Returns:
        The resulting structure
        Throws:
        StructureException
      • getProteinSequenceForStructure

        public static ProteinSequence getProteinSequenceForStructure​(Structure struct,
                                                                     Map<Integer,​Group> groupIndexPosition)
        Generates a ProteinSequence corresponding to the sequence of struct, and maintains a mapping from the sequence back to the original groups. Chains are appended to one another. 'X' is used for heteroatoms.
        Parameters:
        struct - Input structure
        groupIndexPosition - An empty map, which will be populated with (residue index in returned ProteinSequence) -> (Group within struct)
        Returns:
        A ProteinSequence with the full sequence of struct. Chains are concatenated in the same order as the input structures
      • matchSequenceToStructure

        public static ResidueNumber[] matchSequenceToStructure​(ProteinSequence seq,
                                                               Structure struct)
        Given a sequence and the corresponding Structure, get the ResidueNumber for each residue in the sequence.

        Smith-Waterman alignment is used to match the sequences. Residues in the sequence but not the structure or mismatched between sequence and structure will have a null atom, while residues in the structure but not the sequence are ignored with a warning.

        Parameters:
        seq - The protein sequence. Should match the sequence of struct very closely.
        struct - The corresponding protein structure
        Returns:
        A list of ResidueNumbers of the same length as seq, containing either the corresponding residue or null.
      • removeGaps

        public static <T> T[][] removeGaps​(T[][] gapped)
        Creates a new list consisting of all columns of gapped where no row contained a null value. Here, "row" refers to the first index and "column" to the second, eg gapped.get(row).get(column)
        Parameters:
        gapped - A rectangular matrix containing null to mark gaps
        Returns:
        A new List without columns containing nulls