How do I handle #include files or other nested input streams?

Terence Parr

There are a number of ways to handle include files.

  1. At the parser level. Detecting the directive in the grammar. I then extract the filename and instantiate a new lexer and a new parser to parse that file. Finally, I call the getAST() method to return the syntax tree generated from the new parser and stitch that into the current syntax tree.
         myLexer lexer = new myLexer(new FileInputStream(s1.getText()));
         myParser parser = new myParser(lexer);
  2. At the char input stream level. Another approach is to filter the input character stream, maintaining a stack of streams and changing the input state object for the lexer. When your lexer sees a #include or whatever, it pushes the current input state, sets the lexer's input stream to be the include file, set the token type to be Token.SKIP, and let the rule return (the one that matched the #include).
  3. In between the parser and lexer. Another way is to use TokenStream objects. See the includeFile example. The TokenStreamSelector object:
    /** A token stream MUX (multiplexor) knows about n token streams
     *  and can multiplex them onto the same channel for use by token
     *  stream consumer like a parser.  This is a way to have multiple
     *  lexers break up the same input stream for a single parser.

    lets you switch between multiple input streams. When you see an #include, create a new lexer just like you've been doing (no parser), and then notify the TokenStreamSelector (push state and point at new lexer). At the close of the included stream, tell the selector to pop it's state. The parser has no idea that all of this is going on. It attaches to the selector not the lexer :) The parser sees one stream of tokens.

In the old days people would build an input character stream that kept a stack of files and made it look to the lexer like there is only one file.

You cannot do multiple includes in the lexer itself because the parser pulls tokens out of a lexer, a single token at a time. How could a lexer rule return a stream of tokens from a sublexer? It can't. You need to do it in the parser or in between the two as I'll explain in a second.

You can try having the parser make a new lexer/parser that would go grab the files. This should work unless you have lots of class variables that should be instance variables in the parser or lexer. However, this does not let you do includes that can appear anywhere (you'd have to have a test for #include everywhere in your grammar...yuck).

So, the real answer, if you don't like handling the next char stream thing yourself (I understand that concept ;)), is to use the new token stream capabilities. What you want is to create a new lexer for each included file and one for the original and then have a TokenStreamSelector (a multiplexor) handle flipping between the lexers in a stack fashion. The beauty of this is that the parser only sees a single stream of tokens and is none the wiser. You create the first lexer and parser connected via the MUX/selector like this:

      // open a simple stream to the input
      DataInputStream input = new DataInputStream(System.in);

      // attach java lexer to the input stream,
      mainLexer = new PLexer(input);

      // notify selector about starting lexer; name for convenience
      selector.addInputStream(mainLexer, "main");
      selector.select("main"); // start with main P lexer

      // Create parser attached to selector
      parser = new PParser(selector);

      // Parse the input language: P

which looks like:

   Parser - selector - mainLexer
	              - sublexer for first include
	              - subsublexer for nested include

[normally, you only have "Parser - Lexer" for most problems]

When the mainLexer sees an #include, it makes a sublexer, pushes it onto the selector's stack and then does an "abort current token and try again", which is "selector.retry()". This call throws an exception that blows out of the current lexer and forces the selector to get another token, which it does from the newly-pushed sublexer! Cool, eh? All you've done is tell the selector to start pulling tokens from the sublexer. :)

The complete code is in examples/includeFile of 2.7.0 release.