Is it possible to build a (valid or invalid) sentence generator from a grammar?

Terence Parr

The general answer to getting the set of valid sentences is "no" because there are infinitely long strings, but that only matters in theory. If you limit the input to k-token length sentences, the answer is "yes". It is precisely the same thing as finding the the k-token lookahead sets.

As for invalid sentences, that is also infinite anyway you slice it, however, you can do some interesting (using probabilities) random sentence generation given a particular vocabulary: For each position in the sentence, 1..k, grab a random token from the vocabulary. You can then run the sentences into the parser to see what happens. These sentences, unfortunately, would not be based on the grammar. This is usually is a bad test since the first token or two makes the parser detect an error. To get your valid/invalid sentences, I suggest:

  1. compute the k-lookahead (make a new ANTLR CodeGenerator subclass similar to the incomplete HTMLCodeGenerator I think--if that icky code can be grokked). You have your valid sentences now up to length k.
  2. perturb a single random token or two within a random sample of the sentences--this gives you your almost-valid invalid sentences (the kind of thing humans do).
  3. manually build some common-error scenario sentences (like missing parenthesis).