## BioJava:Tutorial:Simple HMMs with BioJava

Tutorial

by David Huen

We will now go through the source of one of the examples in the `demos/dp` directory: `Dice.java` (contributed by Samiul Hasan).

## Introduction

The program implements the “occasionally dishonest casino” example used in the book “Biological Sequence Analysis” by R. Durbin, S. Eddy, A. Krogh, G. Mitchison.

Basically, it conceives a casino with two dice, one fair and one loaded. The fair die lands on any of its sides equal probability while the loaded die yields “6” half the time, all the other sides being of equal probability. These probabilities represent the emission distribution of the fair die state and the loaded die states respectively.

The casino switches between using the fair die and the loaded die periodically. When on the fair die, the probability that the next throw is with the fair die too is 0.95. Similarly, when on the loaded die, the probability of continuing with it is 0.90. These probabilities yield the transition distributions of the states.

The HMM as modelled in the code is slightly modified from the above description with the inclusion of a MagicalState. This state is used to represent the start and end of the states of the model. The transition from the MagicalState to the fair die state occurs with a probability of 0.8 while the transition to the loaded die state occurs with a probability of 0.2. A termination condition was also introduced to allow transitions from the fair die and loaded die states to the Magical state with a probability of 0.01.

## Code

The core of the program is the `createCasino()` method. This creates an instance of the `MarkovModel` class that implements the model.

``````  public static MarkovModel createCasino() {
Symbol[] rolls=new Symbol[6];

//set up the dice alphabet
SimpleAlphabet diceAlphabet=new SimpleAlphabet();
diceAlphabet.setName("DiceAlphabet");

for(int i=1;i<7;i++) {
try {
rolls[i-1]= AlphabetManager.createSymbol((char)('0'+i),""+i,Annotation.EMPTY_ANNOTATION);
} catch (Exception e) {
throw new NestedError(
e, "Can't create symbols to represent dice rolls"
);
}
}
``````

A `Symbol` array `rolls` is created to hold the symbols generated by `AlphabetManager` to represent the outcomes of the dice. An alphabet is also defined over these symbols.

Next, distributions representing the emission probabilities of the fair die and loaded die states are created (named `fairD` and `loadedD` respectively). The die states themselves are then created as `SimpleEmissionStates`, `fairS` and `loadedS` respectively.

You will observe an `int` array advance with a single value of 1. In a single-head HMM like ours, there is only one generated sequence and in our case, we progress along this sole sequence a single position per transition in the model. In multihead HMMs, there will be multiple sequences generated by the HMM and it is possible that the increment through the different sequences might be different. For example, single-stepping a protein sequence amounts to an increment of three on its corresponding DNA sequence.

``````int [] advance = { 1 };
Distribution fairD;
try {
fairD = DistributionFactory.DEFAULT.createDistribution(diceAlphabet);
} catch (Exception e) {
throw new NestedError(e, "Can't create distributions");
}
EmissionState fairS = new SimpleEmissionState("fair", Annotation.EMPTY_ANNOTATION, advance, fairD);
``````

The HMM is then created with these states:-

``````    SimpleMarkovModel casino = new SimpleMarkovModel(1, diceAlphabet, "Casino");
try {
} catch (Exception e) {
throw new NestedError(e, "Can't add states to model");
}
``````

Next, we need to model the transitions between the states. We do this like so:-

``````    try {
casino.createTransition(casino.magicalState(),fairS);
casino.createTransition(fairS,casino.magicalState());
casino.createTransition(fairS,fairS);
} catch (Exception e) {
throw new NestedError(e, "Can't create transitions");
}
``````

Note the presence of a `MagicalState` that is returned by `casino.magicalState()`. This is inherent to the `SimpleMarkovModel` class and does not need to be created by the user.

The emission distributions `fairD` and `loadedD` we set up earlier need to be initialised. We do that here.

``````    try {
for(int i=0;i<rolls.length;i++)   {
fairD.setWeight(rolls[i],1.0/6.0);
}
} catch (Exception e) {
throw new NestedError(e, "Can't set emission probabilities");
}
``````

We also need to initialise the transition distributions. Note how this is done: the transition distribution of each state is requested from the model with a `getWeights()` and then updated with the required values by calling the `getWeight()` method of that distribution. It is not necessary thereafter to call `setWeights()` to pass the distribution for a state back to the model. This may seem strange but it is done this way because model object may use unique `Distribution` classes that cannot be replaced by a generic `Distribution` class for greater internal efficiency. Every state in the model needs to have its own transition distribution initialised appropriately.

``````    //set up transition scores.
try {
Distribution dist;

dist = casino.getWeights(casino.magicalState());
dist.setWeight(fairS, 0.8);

dist = casino.getWeights(fairS);
dist.setWeight(fairS,                 0.95);
dist.setWeight(casino.magicalState(), 0.01);

dist.setWeight(fairS,                 0.09);
dist.setWeight(casino.magicalState(), 0.01);
} catch (Exception e) {
throw new NestedError(e, "Can't set transition probabilities");
}
``````

Having completed constructing the MarkovModel, all that remains is to return it to the caller.

``````    return casino;
``````

## Using the MarkovModel

Having created the `MarkovModel`, we create the corresponding dynamic programming object:-

``````      DP dp=DPFactory.DEFAULT.createDP(casino);
``````

Now, at last, we have something we can use! To generate a sequence of dice throws with this model, we do:-

``````      StatePath obs_rolls = dp.generate(300);

SymbolList roll_sequence = obs_rolls.symbolListForLabel(StatePath.SEQUENCE);
``````

The `generate()` method generates a path through the model that emits 300 symbols and we turn that path into a `SymbolList` with the second line.

At this point, it will be worthwhile digressing briefly on the `StatePath` object. This object embodies an `Alignment` of the sequences emitted by the `DP` object. In a multihead object, multiple aligned sequences will be emitted. In our case, only a single sequence is emitted and that is accessed with the label `StatePath.SEQUENCE`. That sequence turns out to a run of dice throws from our occasionally dishonest casino.

Next, we want to test one of the DP algorithms in the `DP` object. We want to process the `roll_sequence SymbolList` we have just generated and use the Viterbi method to predict which die each of the throws might have arisen from.

To do this, we create an array of `SymbolLists` with only `roll_sequence` in it - ours is a single-head HMM - and apply the Viterbi algorithm using the model probabilities (`ScoreType.PROBABILITY`) for the computation (you could have also applied the null-model or log-odds probabilities here). This will yield the state path that has most support from the model and that state path is the model’s prediction of which die a particular result came from.

``````      SymbolList[] res_array = {roll_sequence};
StatePath v = dp.viterbi(res_array, ScoreType.PROBABILITY);
``````

All that remains is to print out the generated sequence and the actual state path from which it came (ie. which die) and the HMM estimate of the state path, v, (which is which die a particular throw came from as estimated by the HMM).

``````      //print out obs_sequence, output, state symbols.
for(int i = 1; i <= obs_rolls.length()/60; i++) {
for(int j=i*60; j
``````

The first two print statements print the generated sequence and the actual state path by accessing the ```obs_rolls StatePath<code> with the <code>StatePath.SEQUENCE``` and `StatePath.STATES` respectively. The predicted state path comes from the third print block which accesses the `v` StatePath.

The output then looks like this:-

``````544552213525245666363632432522253566166546666666533666543261
fffffffffffflllllllllllfffffffffffflllllllllllllllllllffffff
ffffffffffffffffffffffffffffffffffllllllllllllllllllllffffff

363546253252546524422555242223224344432423341365415551632161
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

144212242323456563652263346116214136666156616666566421456123
fffffflllfffffffffffffffffffffffflfllllllllllllllllfffffffff
fffffffffffffffffffffffffffffffffffllllllllllllllllfffffffff

346313546514332164351242356166641344615135266642261112465663
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
``````

The top line is the sequence emitted by our HMM when we made it generate 300 throws. The next is the state from which the throw came (f-fair l-loaded, these are the first letters of the labels “fair” and “loaded” we used when creating the `SimpleEmissionState` objects that represent the dice). The last is similar but this time from the StatePath `v` that is the result of the Viterbi algorithm. The performance is pretty on this occasion but it can vary widely!