org.finetracker.simple_lm
Class WordCount

java.lang.Object
  extended by org.finetracker.simple_lm.WordCount

public class WordCount
extends java.lang.Object

This class keeps a unigram and a bigram count of a corpus. After counting it is possible to mark infrequent words as UNKs.

Author:
Albert Gerritsen

Field Summary
 java.lang.String EOS_MARKER
           
 NGram EOS_NGRAM
           
 java.lang.String SOS_MARKER
           
 NGram SOS_NGRAM
           
 java.lang.String UNK_MARKER
           
 NGram UNK_NGRAM
           
 
Constructor Summary
WordCount(java.io.Reader r, java.lang.String sos_marker, java.lang.String eos_marker, java.lang.String unk_marker)
          Constructs a WordCount by iterating through the words in the passed Reader.
 
Method Summary
 java.util.Set<java.util.Map.Entry<NGram,java.lang.Integer>> getBGEntries()
          Get the set of all bigram counts
 java.util.Collection<java.lang.Integer> getBGValues()
          Get the set of all bigram values
 java.lang.Integer getNGramCount(NGram n)
          Gets a particular NGram count
 java.util.Set<java.util.Map.Entry<NGram,java.lang.Integer>> getUGEntries()
          Get the set of all unigram counts
 java.util.Collection<java.lang.Integer> getUGValues()
          Get the set of all unigram values
 boolean hasUnknownWords()
          Check if there already is and UNK-marker in this WordCount
 void markUnknownWords(int unigramFloor)
          This will replace all infrequent words by the UNK-marker and count the number of occurrences of UNK.
 void write(java.io.PrintStream out)
          Prints the count as represented in this WordCount
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SOS_MARKER

public final java.lang.String SOS_MARKER

EOS_MARKER

public final java.lang.String EOS_MARKER

UNK_MARKER

public final java.lang.String UNK_MARKER

SOS_NGRAM

public final NGram SOS_NGRAM

EOS_NGRAM

public final NGram EOS_NGRAM

UNK_NGRAM

public final NGram UNK_NGRAM
Constructor Detail

WordCount

public WordCount(java.io.Reader r,
                 java.lang.String sos_marker,
                 java.lang.String eos_marker,
                 java.lang.String unk_marker)
Constructs a WordCount by iterating through the words in the passed Reader. The sos_marker and the eos_marker are used as a key for the count the start/end of a sentence.

Parameters:
r - The source from which we should read the corpus
sos_marker - The key for the found [start of sentence]-tokens
eos_marker - The key for the found [end of sentence]-tokens
unk_marker - The key for the found [unk]-tokens
Method Detail

markUnknownWords

public void markUnknownWords(int unigramFloor)
This will replace all infrequent words by the UNK-marker and count the number of occurrences of UNK.

Parameters:
unigramFloor - The minimum number of times a words should appear

hasUnknownWords

public boolean hasUnknownWords()
Check if there already is and UNK-marker in this WordCount

Returns:
true when such an unk_marker is found, false otherwise

write

public void write(java.io.PrintStream out)
Prints the count as represented in this WordCount

Parameters:
out - The PrintStream to which the output should be directed

getNGramCount

public java.lang.Integer getNGramCount(NGram n)
Gets a particular NGram count

Parameters:
n - The NGram we are interested in
Returns:
The requested count

getUGEntries

public java.util.Set<java.util.Map.Entry<NGram,java.lang.Integer>> getUGEntries()
Get the set of all unigram counts

Returns:
the unigram counts

getBGEntries

public java.util.Set<java.util.Map.Entry<NGram,java.lang.Integer>> getBGEntries()
Get the set of all bigram counts

Returns:
the bigram counts

getUGValues

public java.util.Collection<java.lang.Integer> getUGValues()
Get the set of all unigram values

Returns:
the unigram values

getBGValues

public java.util.Collection<java.lang.Integer> getBGValues()
Get the set of all bigram values

Returns:
the bigram values