How can I specify which characters StreamTokenizer treats as token delimiters?

Tim Rohaly

StreamTokenizer breaks the input stream into tokens using whitespace as a delimiter. By default, Unicode characters u0000 through u0020 are considered whitespace. This encompasses things like space, tab, newline, etc. If you want to change this list, you need to invoke the method whitespaceChars(int low, int high); all characters having Unicode values between low and high will be considered whitespace, in addition to the default set.

You can call whitespaceChars() any number of times - each invocation will add to the list of whitespace characters. The only way to clear out the list is to set those characters to be something other than whitespace - you might use ordinaryChar(int ch), ordinaryChars(int low, int high), wordChars(int low, int high), or resetSyntax() to do this.

The following program is a very simple example of using StreamTokenizer to parse a text file into words, number, and characters. The file to be parsed is taken from the first argument; the second argument is a string containing all the characters to use as delimiters.
import java.io.*;

public class TokenizeIt {

    public static void main(String[] args) throws FileNotFoundException,
                                                  IOException {

        FileReader      file = new FileReader(args[0]);
        BufferedReader  in   = new BufferedReader(file);
        StreamTokenizer st   = new StreamTokenizer(in);

        char[] c = args[1].toCharArray();
        for (int i=0; i<c.length; i++) {
            System.out.println("Whitespace will include '" + c[i] + "'");
            st.whitespaceChars(c[i], c[i]);
        int tokval;
        while ((tokval = st.nextToken()) != StreamTokenizer.TT_EOF) {
            switch (tokval) {
                case StreamTokenizer.TT_WORD:
                    System.out.println("Word token   "" + st.sval + """);
                case StreamTokenizer.TT_NUMBER:
                    System.out.println("Number token "" + st.nval + """);
                    System.out.println("Character    '" + (char) tokval + "'");
For example, if the input is delimited by commas and colons, you would run this using the command line:
    java TokenizeIt ",:"

0 Comments  (click to add your comment)
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



About | Sitemap | Contact