|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--sjm.parse.tokens.Tokenizer
A tokenizer divides a string into tokens. This class is
highly customizable with regard to exactly how this division
occurs, but it also has defaults that are suitable for many
languages. This class assumes that the character values read
from the string lie in the range 0-255. For example, the
Unicode value of a capital A is 65, so
System.out.println((char)65);
prints out a
capital A.
The behavior of a tokenizer depends on its character state
table. This table is an array of 256 TokenizerState
states. The state table decides which state to
enter upon reading a character from the input
string.
For example, by default, upon reading an 'A', a tokenizer
will enter a "word" state. This means the tokenizer will
ask a WordState
object to consume the 'A',
along with the characters after the 'A' that form a word.
The state's responsibility is to consume characters and
return a complete token.
The default table sets a SymbolState for every character from 0 to 255, and then overrides this with:
In addition to allowing modification of the state table, this class makes each of the states above available. Some of these states are customizable. For example, wordState allows customization of what characters can be part of a word, after the first character.From To State 0 ' ' whitespaceState 'a' 'z' wordState 'A' 'Z' wordState 160 255 wordState '0' '9' numberState '-' '-' numberState '.' '.' numberState '"' '"' quoteState '\'' '\'' quoteState '/' '/' slashState
Field Summary | |
protected TokenizerState[] |
characterState
|
protected static int |
DEFAULT_SYMBOL_MAX
|
protected NumberState |
numberState
|
protected QuoteState |
quoteState
|
protected java.io.PushbackReader |
reader
|
protected SlashState |
slashState
|
protected SymbolState |
symbolState
|
protected WhitespaceState |
whitespaceState
|
protected WordState |
wordState
|
Constructor Summary | |
Tokenizer()
Constructs a tokenizer with a default state table (as described in the class comment). |
|
Tokenizer(java.lang.String s)
Constructs a tokenizer to read from the supplied string. |
Method Summary | |
java.io.PushbackReader |
getReader()
Return the reader this tokenizer will read from. |
Token |
nextToken()
Return the next token. |
NumberState |
numberState()
Return the state this tokenizer uses to build numbers. |
QuoteState |
quoteState()
Return the state this tokenizer uses to build quoted strings. |
void |
setCharacterState(int from,
int to,
TokenizerState state)
Change the state the tokenizer will enter upon reading any character between "from" and "to". |
void |
setReader(java.io.PushbackReader r)
Set the reader to read from. |
void |
setString(java.lang.String s)
Set the string to read from. |
void |
setString(java.lang.String s,
int symbolMax)
Set the string to read from. |
SlashState |
slashState()
Return the state this tokenizer uses to recognize (and ignore) comments. |
SymbolState |
symbolState()
Return the state this tokenizer uses to recognize symbols. |
WhitespaceState |
whitespaceState()
Return the state this tokenizer uses to recognize (and ignore) whitespace. |
WordState |
wordState()
Return the state this tokenizer uses to build words. |
Methods inherited from class java.lang.Object |
|
Field Detail |
protected java.io.PushbackReader reader
protected static final int DEFAULT_SYMBOL_MAX
protected TokenizerState[] characterState
protected NumberState numberState
protected QuoteState quoteState
protected SlashState slashState
protected SymbolState symbolState
protected WhitespaceState whitespaceState
protected WordState wordState
Constructor Detail |
public Tokenizer()
public Tokenizer(java.lang.String s)
String
- the string to read fromMethod Detail |
public java.io.PushbackReader getReader()
public Token nextToken() throws java.io.IOException
java.io.IOException
- if there is any problem readingpublic NumberState numberState()
public QuoteState quoteState()
public void setCharacterState(int from, int to, TokenizerState state)
from
- the "from" characterto
- the "to" characterTokenizerState
- the state to enter upon reading a
character between "from" and "to"public void setReader(java.io.PushbackReader r)
PushbackReader
- the reader to read frompublic void setString(java.lang.String s)
String
- the string to read frompublic void setString(java.lang.String s, int symbolMax)
String
- the string to read fromint
- the maximum length of a symbol, which
establishes the size of pushback buffer
we needpublic SlashState slashState()
public SymbolState symbolState()
public WhitespaceState whitespaceState()
public WordState wordState()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |