: Class Tokenizer

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: INNER | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

sjm.parse.tokens
Class Tokenizer

java.lang.Object
  |
  +--sjm.parse.tokens.Tokenizer

Direct Known Subclasses:: Tokenizer

public class Tokenizer
extends java.lang.Object

A tokenizer divides a string into tokens. This class is highly customizable with regard to exactly how this division occurs, but it also has defaults that are suitable for many languages. This class assumes that the character values read from the string lie in the range 0-255. For example, the Unicode value of a capital A is 65, so System.out.println((char)65); prints out a capital A.

The behavior of a tokenizer depends on its character state table. This table is an array of 256 TokenizerState states. The state table decides which state to enter upon reading a character from the input string.

For example, by default, upon reading an 'A', a tokenizer will enter a "word" state. This means the tokenizer will ask a WordState object to consume the 'A', along with the characters after the 'A' that form a word. The state's responsibility is to consume characters and return a complete token.

The default table sets a SymbolState for every character from 0 to 255, and then overrides this with:

     From    To     State
       0     ' '    whitespaceState
      'a'    'z'    wordState
      'A'    'Z'    wordState
     160     255    wordState
      '0'    '9'    numberState
      '-'    '-'    numberState
      '.'    '.'    numberState
      '"'    '"'    quoteState
     '\''   '\''    quoteState
      '/'    '/'    slashState

In addition to allowing modification of the state table, this class makes each of the states above available. Some of these states are customizable. For example, wordState allows customization of what characters can be part of a word, after the first character.

Version:: 1.0
Author:: Steven J. Metsker

Field Summary

protected TokenizerState[] characterState


protected static int DEFAULT_SYMBOL_MAX


protected NumberState numberState


protected QuoteState quoteState


protected java.io.PushbackReader reader


protected SlashState slashState


protected SymbolState symbolState


protected WhitespaceState whitespaceState


protected WordState wordState


Constructor Summary

Tokenizer()
          Constructs a tokenizer with a default state table (as described in the class comment).

Tokenizer(java.lang.String s)
          Constructs a tokenizer to read from the supplied string.

Method Summary

java.io.PushbackReader getReader()
          Return the reader this tokenizer will read from.

Token nextToken()
          Return the next token.

NumberState numberState()
          Return the state this tokenizer uses to build numbers.

QuoteState quoteState()
          Return the state this tokenizer uses to build quoted strings.

void setCharacterState(int from, int to, TokenizerState state)
          Change the state the tokenizer will enter upon reading any character between "from" and "to".

void setReader(java.io.PushbackReader r)
          Set the reader to read from.

void setString(java.lang.String s)
          Set the string to read from.

void setString(java.lang.String s, int symbolMax)
          Set the string to read from.

SlashState slashState()
          Return the state this tokenizer uses to recognize (and ignore) comments.

SymbolState symbolState()
          Return the state this tokenizer uses to recognize symbols.

WhitespaceState whitespaceState()
          Return the state this tokenizer uses to recognize (and ignore) whitespace.

WordState wordState()
          Return the state this tokenizer uses to build words.

Methods inherited from class java.lang.Object

, clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

reader

protected java.io.PushbackReader reader

DEFAULT_SYMBOL_MAX

protected static final int DEFAULT_SYMBOL_MAX

characterState

protected TokenizerState[] characterState

numberState

protected NumberState numberState

quoteState

protected QuoteState quoteState

slashState

protected SlashState slashState

symbolState

protected SymbolState symbolState

whitespaceState

protected WhitespaceState whitespaceState

wordState

protected WordState wordState

Constructor Detail

Tokenizer

public Tokenizer()

Constructs a tokenizer with a default state table (as described in the class comment).

Tokenizer

public Tokenizer(java.lang.String s)

Constructs a tokenizer to read from the supplied string.

Parameters:: String - the string to read from

Method Detail

getReader

public java.io.PushbackReader getReader()

Return the reader this tokenizer will read from.

Returns:: the reader this tokenizer will read from

nextToken

public Token nextToken()
                throws java.io.IOException

Return the next token.

Returns:: the next token.
Throws:: java.io.IOException - if there is any problem reading

numberState

public NumberState numberState()

Return the state this tokenizer uses to build numbers.

Returns:: the state this tokenizer uses to build numbers

quoteState

public QuoteState quoteState()

Return the state this tokenizer uses to build quoted strings.

Returns:: the state this tokenizer uses to build quoted strings

setCharacterState

public void setCharacterState(int from,
                              int to,
                              TokenizerState state)

Change the state the tokenizer will enter upon reading any character between "from" and "to".

Parameters:: from - the "from" character; to - the "to" character; TokenizerState - the state to enter upon reading a character between "from" and "to"

setReader

public void setReader(java.io.PushbackReader r)

Set the reader to read from.

Parameters:: PushbackReader - the reader to read from

setString

public void setString(java.lang.String s)

Set the string to read from.

Parameters:: String - the string to read from

setString

public void setString(java.lang.String s,
                      int symbolMax)

Set the string to read from.

Parameters:: String - the string to read from; int - the maximum length of a symbol, which establishes the size of pushback buffer we need

slashState

public SlashState slashState()

Return the state this tokenizer uses to recognize (and ignore) comments.

Returns:: the state this tokenizer uses to recognize (and ignore) comments

symbolState

public SymbolState symbolState()

Return the state this tokenizer uses to recognize symbols.

Returns:: the state this tokenizer uses to recognize symbols

whitespaceState

public WhitespaceState whitespaceState()

Return the state this tokenizer uses to recognize (and ignore) whitespace.

Returns:: the state this tokenizer uses to recognize whitespace

wordState

public WordState wordState()

Return the state this tokenizer uses to build words.

Returns:: the state this tokenizer uses to build words

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: INNER | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Field Summary
`protected TokenizerState[]`	`characterState`
`protected static int`	`DEFAULT_SYMBOL_MAX`
`protected NumberState`	`numberState`
`protected QuoteState`	`quoteState`
`protected java.io.PushbackReader`	`reader`
`protected SlashState`	`slashState`
`protected SymbolState`	`symbolState`
`protected WhitespaceState`	`whitespaceState`
`protected WordState`	`wordState`

Constructor Summary
`Tokenizer()` Constructs a tokenizer with a default state table (as described in the class comment).
`Tokenizer(java.lang.String s)` Constructs a tokenizer to read from the supplied string.

Method Summary
`java.io.PushbackReader`	`getReader()` Return the reader this tokenizer will read from.
`Token`	`nextToken()` Return the next token.
`NumberState`	`numberState()` Return the state this tokenizer uses to build numbers.
`QuoteState`	`quoteState()` Return the state this tokenizer uses to build quoted strings.
`void`	`setCharacterState(int from, int to, TokenizerState state)` Change the state the tokenizer will enter upon reading any character between "from" and "to".
`void`	`setReader(java.io.PushbackReader r)` Set the reader to read from.
`void`	`setString(java.lang.String s)` Set the string to read from.
`void`	`setString(java.lang.String s, int symbolMax)` Set the string to read from.
`SlashState`	`slashState()` Return the state this tokenizer uses to recognize (and ignore) comments.
`SymbolState`	`symbolState()` Return the state this tokenizer uses to recognize symbols.
`WhitespaceState`	`whitespaceState()` Return the state this tokenizer uses to recognize (and ignore) whitespace.
`WordState`	`wordState()` Return the state this tokenizer uses to build words.

sjm.parse.tokens Class Tokenizer

reader

DEFAULT_SYMBOL_MAX

characterState

numberState

quoteState

slashState

symbolState

whitespaceState

wordState

Tokenizer

Tokenizer

getReader

nextToken

numberState

quoteState

setCharacterState

setReader

setString

setString

slashState

symbolState

whitespaceState

wordState

sjm.parse.tokens
Class Tokenizer