BioJava:Tutorial:Simple HMMs with BioJava

Tutorial

by David Huen

We will now go through the source of one of the examples in the demos/dp directory: Dice.java (contributed by Samiul Hasan).

Introduction

The program implements the “occasionally dishonest casino” example used in the book “Biological Sequence Analysis” by R. Durbin, S. Eddy, A. Krogh, G. Mitchison.

The resultant HMM looks like
this.

Basically, it conceives a casino with two dice, one fair and one loaded. The fair die lands on any of its sides equal probability while the loaded die yields “6” half the time, all the other sides being of equal probability. These probabilities represent the emission distribution of the fair die state and the loaded die states respectively.

The casino switches between using the fair die and the loaded die periodically. When on the fair die, the probability that the next throw is with the fair die too is 0.95. Similarly, when on the loaded die, the probability of continuing with it is 0.90. These probabilities yield the transition distributions of the states.

The HMM as modelled in the code is slightly modified from the above description with the inclusion of a MagicalState. This state is used to represent the start and end of the states of the model. The transition from the MagicalState to the fair die state occurs with a probability of 0.8 while the transition to the loaded die state occurs with a probability of 0.2. A termination condition was also introduced to allow transitions from the fair die and loaded die states to the Magical state with a probability of 0.01.

Code

The core of the program is the createCasino() method. This creates an instance of the MarkovModel class that implements the model.

  public static MarkovModel createCasino() {
    Symbol[] rolls=new Symbol[6];

    //set up the dice alphabet
    SimpleAlphabet diceAlphabet=new SimpleAlphabet();
    diceAlphabet.setName("DiceAlphabet");

    for(int i=1;i<7;i++) {
      try {
        rolls[i-1]= AlphabetManager.createSymbol((char)('0'+i),""+i,Annotation.EMPTY_ANNOTATION);
        diceAlphabet.addSymbol(rolls[i-1]);
      } catch (Exception e) {
        throw new NestedError(
          e, "Can't create symbols to represent dice rolls"
        );
      }
    }

A Symbol array rolls is created to hold the symbols generated by AlphabetManager to represent the outcomes of the dice. An alphabet is also defined over these symbols.

Next, distributions representing the emission probabilities of the fair die and loaded die states are created (named fairD and loadedD respectively). The die states themselves are then created as SimpleEmissionStates, fairS and loadedS respectively.

You will observe an int array advance with a single value of 1. In a single-head HMM like ours, there is only one generated sequence and in our case, we progress along this sole sequence a single position per transition in the model. In multihead HMMs, there will be multiple sequences generated by the HMM and it is possible that the increment through the different sequences might be different. For example, single-stepping a protein sequence amounts to an increment of three on its corresponding DNA sequence.

int [] advance = { 1 };
Distribution fairD;
Distribution loadedD;
try {
  fairD = DistributionFactory.DEFAULT.createDistribution(diceAlphabet);
  loadedD = DistributionFactory.DEFAULT.createDistribution(diceAlphabet);
} catch (Exception e) {
  throw new NestedError(e, "Can't create distributions");
}
EmissionState fairS = new SimpleEmissionState("fair", Annotation.EMPTY_ANNOTATION, advance, fairD);
EmissionState loadedS = new SimpleEmissionState("loaded", Annotation.EMPTY_ANNOTATION, advance, loadedD);

The HMM is then created with these states:-

    SimpleMarkovModel casino = new SimpleMarkovModel(1, diceAlphabet, "Casino");
    try {
      casino.addState(fairS);
      casino.addState(loadedS);
    } catch (Exception e) {
      throw new NestedError(e, "Can't add states to model");
    }

Next, we need to model the transitions between the states. We do this like so:-

    try {
      casino.createTransition(casino.magicalState(),fairS);
      casino.createTransition(casino.magicalState(),loadedS);
      casino.createTransition(fairS,casino.magicalState());
      casino.createTransition(loadedS,casino.magicalState());
      casino.createTransition(fairS,loadedS);
      casino.createTransition(loadedS,fairS);
      casino.createTransition(fairS,fairS);
      casino.createTransition(loadedS,loadedS);
    } catch (Exception e) {
      throw new NestedError(e, "Can't create transitions");
    }

Note the presence of a MagicalState that is returned by casino.magicalState(). This is inherent to the SimpleMarkovModel class and does not need to be created by the user.

The emission distributions fairD and loadedD we set up earlier need to be initialised. We do that here.

    try {
      for(int i=0;i<rolls.length;i++)   {
        fairD.setWeight(rolls[i],1.0/6.0);
        loadedD.setWeight(rolls[i], 0.1);
      }
      loadedD.setWeight(rolls[5],0.5);
    } catch (Exception e) {
      throw new NestedError(e, "Can't set emission probabilities");
    }

We also need to initialise the transition distributions. Note how this is done: the transition distribution of each state is requested from the model with a getWeights() and then updated with the required values by calling the getWeight() method of that distribution. It is not necessary thereafter to call setWeights() to pass the distribution for a state back to the model. This may seem strange but it is done this way because model object may use unique Distribution classes that cannot be replaced by a generic Distribution class for greater internal efficiency. Every state in the model needs to have its own transition distribution initialised appropriately.

    //set up transition scores.
    try {
      Distribution dist;

      dist = casino.getWeights(casino.magicalState());
      dist.setWeight(fairS, 0.8);
      dist.setWeight(loadedS, 0.2);

      dist = casino.getWeights(fairS);
      dist.setWeight(loadedS,               0.04);
      dist.setWeight(fairS,                 0.95);
      dist.setWeight(casino.magicalState(), 0.01);

      dist = casino.getWeights(loadedS);
      dist.setWeight(fairS,                 0.09);
      dist.setWeight(loadedS,               0.90);
      dist.setWeight(casino.magicalState(), 0.01);
    } catch (Exception e) {
      throw new NestedError(e, "Can't set transition probabilities");
    }

Having completed constructing the MarkovModel, all that remains is to return it to the caller.

    return casino;

Using the MarkovModel

Having created the MarkovModel, we create the corresponding dynamic programming object:-

      DP dp=DPFactory.DEFAULT.createDP(casino);

Now, at last, we have something we can use! To generate a sequence of dice throws with this model, we do:-

      StatePath obs_rolls = dp.generate(300);

      SymbolList roll_sequence = obs_rolls.symbolListForLabel(StatePath.SEQUENCE);

The generate() method generates a path through the model that emits 300 symbols and we turn that path into a SymbolList with the second line.

At this point, it will be worthwhile digressing briefly on the StatePath object. This object embodies an Alignment of the sequences emitted by the DP object. In a multihead object, multiple aligned sequences will be emitted. In our case, only a single sequence is emitted and that is accessed with the label StatePath.SEQUENCE. That sequence turns out to a run of dice throws from our occasionally dishonest casino.

Next, we want to test one of the DP algorithms in the DP object. We want to process the roll_sequence SymbolList we have just generated and use the Viterbi method to predict which die each of the throws might have arisen from.

To do this, we create an array of SymbolLists with only roll_sequence in it - ours is a single-head HMM - and apply the Viterbi algorithm using the model probabilities (ScoreType.PROBABILITY) for the computation (you could have also applied the null-model or log-odds probabilities here). This will yield the state path that has most support from the model and that state path is the model’s prediction of which die a particular result came from.

      SymbolList[] res_array = {roll_sequence};
      StatePath v = dp.viterbi(res_array, ScoreType.PROBABILITY);

All that remains is to print out the generated sequence and the actual state path from which it came (ie. which die) and the HMM estimate of the state path, v, (which is which die a particular throw came from as estimated by the HMM).

      //print out obs_sequence, output, state symbols.
      for(int i = 1; i <= obs_rolls.length()/60; i++) {
        for(int j=i*60; j

The first two print statements print the generated sequence and the actual state path by accessing the obs_rolls StatePath<code> with the <code>StatePath.SEQUENCE and StatePath.STATES respectively. The predicted state path comes from the third print block which accesses the v StatePath.

The output then looks like this:-

544552213525245666363632432522253566166546666666533666543261
fffffffffffflllllllllllfffffffffffflllllllllllllllllllffffff
ffffffffffffffffffffffffffffffffffllllllllllllllllllllffffff

363546253252546524422555242223224344432423341365415551632161
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

144212242323456563652263346116214136666156616666566421456123
fffffflllfffffffffffffffffffffffflfllllllllllllllllfffffffff
fffffffffffffffffffffffffffffffffffllllllllllllllllfffffffff

346313546514332164351242356166641344615135266642261112465663
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

The top line is the sequence emitted by our HMM when we made it generate 300 throws. The next is the state from which the throw came (f-fair l-loaded, these are the first letters of the labels “fair” and “loaded” we used when creating the SimpleEmissionState objects that represent the dice). The last is similar but this time from the StatePath v that is the result of the Viterbi algorithm. The performance is pretty on this occasion but it can vary widely!