Hidden Markov Machine

1. Introduction
2. inference problems to solve:
3. apply to tagging problem
- 3.1. parameters:

1 Introduction

Hidden Markov Machine (HHM) is used to process the sequential data. It models the hidden states of observations. Given a sequence \(\mathbf{s}=(s_{1},\dots,s_{n})\) while \(s_{i}\) is an single element and \(n\) is variable. HHM models the probability of generating the observation with hidden units. Notice that one hidden unit corresponds to one observed element. Each hidden unit and each observed element have their own state, where the state of hidden unit is called hidden state and the state of observed element is called observed state. HHM model specify the probability of each hidden state generating each observed state and the transition probability between hidden states.

For example, if \(\mathbf{s}\) is a sentence, \(s_{i}\) can be a word, and the length of sentence is variable. For POS tagging task, the hidden state can be assigned to different POS labels, the number of hidden state is the number of POS labels. Tagging process is actually finding the hidden states of each word that maximize the probability of HHM generating the sentence, which is computed based on the probability of each POS label to word and the probability of preceding POS label to succeeding POS label.

For simplicity, we discuss the first order HHM here, in where the state of the current hidden unit only rely on the one before it. A first order HHM contains two types of parameters:

Emission probabilities: probability of hidden states generate the

observed value \(\mathbf{B}\), where \(\mathbf{B}_{i,j}\) denotes the probability of hidden unit \(i\) generating observed state \(j\).

Transition probabilities: probability of hidden states given the

precedent hidden states (can be more than one states) \(\mathbf{P}\) where \(\mathbf{P}_{i,j}\) is the probability of hidden state \(j\) jumps to hidden state \(i\).

2 inference problems to solve:

2.1 compute the probability of a hhm generating an observation sequence.

solved by forward algorithm (\(\alpha\) algorithm)

2.2 compute …

solved by backward algorithm (\(\beta\) algorithm)

2.3 compute the most likely hidden states given an observation sequence

input \(\mathbf{P}\), \(\mathbf{B}\), \(\mathbf{x}=(x_{1},\dots,x_{T})\)
solved by Viterbi algorithm
1. the probability of state 1 is the emission probability multiple prior probability of states \(\gamma_{1}=\mathbf{b}\).
2. then we pass each observation_i (\(i \ge 2\)) to compute the probability of current state. Expand \(\gamma_{i-1}\) to \(\Gamma_{i-1}=[\gamma_{i-1};\dots;\gamma_{i-1}]\) \(\gamma_{i}=\mathbf{b}\times \max(\mathbf{P}\times\Gamma_{i-1})\).
3. Save the preceding states that produce the current states. \(Ptr(i,i-1) = \{state_i\rightarrow argmax_{j}\mathbf{P}[i,j]\times\gamma_{i-1}[j]\}\)
4. The optimal states is retrieved by saving back pointers.

3 apply to tagging problem

3.1 parameters:

we have \(\mathbf{P}\), which is the transition matrix between POS
we have \(\mathbf{B}\) stores the probability of POS emit word \(w\). denote \(ind(w)\) is the function that return the index of word \(w\), \(\mathbf{B}_{i,j}\) is the probability of the POS \(i\) emits word \(w\) where \(ind(w)=j\).