org.finetracker.simple_lm
Class LanguageModel

java.lang.Object
  extended by org.finetracker.simple_lm.LanguageModel
Direct Known Subclasses:
HTKLanguageModel

public class LanguageModel
extends java.lang.Object

This is a unigram/bigram model. This class doesn't provide the constructor to create such a model. Only members to store this information and methods to read from a ARPA-file and output an ARPA-file. To create a particular bigram scheme simply subclass this class.

Author:
Albert Gerritsen

Field Summary
static java.lang.String EOS_MARKER
          Marker used for the [End Of Sentence]-token in the ARPA format
static java.lang.String SOS_MARKER
          Marker used for the [Start Of Sentence]-token in the ARPA format
static java.lang.String UNK_MARKER
          Marker used for the [Unknown Word]-token in the ARPA format
 
Constructor Summary
LanguageModel(java.io.Reader r)
          Constructs a bigram model from a ARPA-file that is streamed through the supplied Reader.
 
Method Summary
 double p(java.lang.String word)
          Get the unigram probability for a word
 double p(java.lang.String prevWord, java.lang.String word)
          Get the bigram probability for a pair of words
 void write(java.io.PrintStream out)
          Outputs this bigram model to a PrintStream in ARPA-format
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SOS_MARKER

public static final java.lang.String SOS_MARKER
Marker used for the [Start Of Sentence]-token in the ARPA format

See Also:
Constant Field Values

EOS_MARKER

public static final java.lang.String EOS_MARKER
Marker used for the [End Of Sentence]-token in the ARPA format

See Also:
Constant Field Values

UNK_MARKER

public static final java.lang.String UNK_MARKER
Marker used for the [Unknown Word]-token in the ARPA format

See Also:
Constant Field Values
Constructor Detail

LanguageModel

public LanguageModel(java.io.Reader r)
Constructs a bigram model from a ARPA-file that is streamed through the supplied Reader.

Parameters:
r - The Reader that contains the ARPA file
Method Detail

p

public double p(java.lang.String word)
Get the unigram probability for a word

Parameters:
word - The word of which we want to know the probability
Returns:
The probability in log10-space

p

public double p(java.lang.String prevWord,
                java.lang.String word)
Get the bigram probability for a pair of words

Parameters:
prevWord - The first word of the pair of which we want to know the probability
word - The second/last word of the pair of which we want to know the probability
Returns:
The probability in log10-space

write

public void write(java.io.PrintStream out)
Outputs this bigram model to a PrintStream in ARPA-format

Parameters:
out - The PrintStream to which the LM should be outputted