sjm.parse.tokens
Class Tokenizer

java.lang.Object
  |
  +--sjm.parse.tokens.Tokenizer
Direct Known Subclasses:
Tokenizer

public class Tokenizer
extends java.lang.Object

A tokenizer divides a string into tokens. This class is highly customizable with regard to exactly how this division occurs, but it also has defaults that are suitable for many languages. This class assumes that the character values read from the string lie in the range 0-255. For example, the Unicode value of a capital A is 65, so System.out.println((char)65); prints out a capital A.

The behavior of a tokenizer depends on its character state table. This table is an array of 256 TokenizerState states. The state table decides which state to enter upon reading a character from the input string.

For example, by default, upon reading an 'A', a tokenizer will enter a "word" state. This means the tokenizer will ask a WordState object to consume the 'A', along with the characters after the 'A' that form a word. The state's responsibility is to consume characters and return a complete token.

The default table sets a SymbolState for every character from 0 to 255, and then overrides this with:

     From    To     State
       0     ' '    whitespaceState
      'a'    'z'    wordState
      'A'    'Z'    wordState
     160     255    wordState
      '0'    '9'    numberState
      '-'    '-'    numberState
      '.'    '.'    numberState
      '"'    '"'    quoteState
     '\''   '\''    quoteState
      '/'    '/'    slashState
 
In addition to allowing modification of the state table, this class makes each of the states above available. Some of these states are customizable. For example, wordState allows customization of what characters can be part of a word, after the first character.

Version:
1.0
Author:
Steven J. Metsker

Field Summary
protected  TokenizerState[] characterState
           
protected static int DEFAULT_SYMBOL_MAX
           
protected  NumberState numberState
           
protected  QuoteState quoteState
           
protected  java.io.PushbackReader reader
           
protected  SlashState slashState
           
protected  SymbolState symbolState
           
protected  WhitespaceState whitespaceState
           
protected  WordState wordState
           
 
Constructor Summary
Tokenizer()
          Constructs a tokenizer with a default state table (as described in the class comment).
Tokenizer(java.lang.String s)
          Constructs a tokenizer to read from the supplied string.
 
Method Summary
 java.io.PushbackReader getReader()
          Return the reader this tokenizer will read from.
 Token nextToken()
          Return the next token.
 NumberState numberState()
          Return the state this tokenizer uses to build numbers.
 QuoteState quoteState()
          Return the state this tokenizer uses to build quoted strings.
 void setCharacterState(int from, int to, TokenizerState state)
          Change the state the tokenizer will enter upon reading any character between "from" and "to".
 void setReader(java.io.PushbackReader r)
          Set the reader to read from.
 void setString(java.lang.String s)
          Set the string to read from.
 void setString(java.lang.String s, int symbolMax)
          Set the string to read from.
 SlashState slashState()
          Return the state this tokenizer uses to recognize (and ignore) comments.
 SymbolState symbolState()
          Return the state this tokenizer uses to recognize symbols.
 WhitespaceState whitespaceState()
          Return the state this tokenizer uses to recognize (and ignore) whitespace.
 WordState wordState()
          Return the state this tokenizer uses to build words.
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reader

protected java.io.PushbackReader reader

DEFAULT_SYMBOL_MAX

protected static final int DEFAULT_SYMBOL_MAX

characterState

protected TokenizerState[] characterState

numberState

protected NumberState numberState

quoteState

protected QuoteState quoteState

slashState

protected SlashState slashState

symbolState

protected SymbolState symbolState

whitespaceState

protected WhitespaceState whitespaceState

wordState

protected WordState wordState
Constructor Detail

Tokenizer

public Tokenizer()
Constructs a tokenizer with a default state table (as described in the class comment).

Tokenizer

public Tokenizer(java.lang.String s)
Constructs a tokenizer to read from the supplied string.
Parameters:
String - the string to read from
Method Detail

getReader

public java.io.PushbackReader getReader()
Return the reader this tokenizer will read from.
Returns:
the reader this tokenizer will read from

nextToken

public Token nextToken()
                throws java.io.IOException
Return the next token.
Returns:
the next token.
Throws:
java.io.IOException - if there is any problem reading

numberState

public NumberState numberState()
Return the state this tokenizer uses to build numbers.
Returns:
the state this tokenizer uses to build numbers

quoteState

public QuoteState quoteState()
Return the state this tokenizer uses to build quoted strings.
Returns:
the state this tokenizer uses to build quoted strings

setCharacterState

public void setCharacterState(int from,
                              int to,
                              TokenizerState state)
Change the state the tokenizer will enter upon reading any character between "from" and "to".
Parameters:
from - the "from" character
to - the "to" character
TokenizerState - the state to enter upon reading a character between "from" and "to"

setReader

public void setReader(java.io.PushbackReader r)
Set the reader to read from.
Parameters:
PushbackReader - the reader to read from

setString

public void setString(java.lang.String s)
Set the string to read from.
Parameters:
String - the string to read from

setString

public void setString(java.lang.String s,
                      int symbolMax)
Set the string to read from.
Parameters:
String - the string to read from
int - the maximum length of a symbol, which establishes the size of pushback buffer we need

slashState

public SlashState slashState()
Return the state this tokenizer uses to recognize (and ignore) comments.
Returns:
the state this tokenizer uses to recognize (and ignore) comments

symbolState

public SymbolState symbolState()
Return the state this tokenizer uses to recognize symbols.
Returns:
the state this tokenizer uses to recognize symbols

whitespaceState

public WhitespaceState whitespaceState()
Return the state this tokenizer uses to recognize (and ignore) whitespace.
Returns:
the state this tokenizer uses to recognize whitespace

wordState

public WordState wordState()
Return the state this tokenizer uses to build words.
Returns:
the state this tokenizer uses to build words