- What is Jparsec?
- Why another parser framework?
- How do I use Jparsec?
- What is Parser?
- What is a scanner object?
- What is a parser object?
- How does jparsec handle left-recursion?
- Does jparsec support operator precedence grammar?
- What is "Parsers.one()"?
- How is EBNF sequencing represented in jparsec?
- How is EBNF alternation represented in jparsec?
- How is kleen star (zero or more) represented in jparsec?
- How is "one-or-more" represented in jparsec?
- How is "N-or-more" represented in jparsec?
- How is "no more than N" represented in jparsec?
- How is "N-or-more-but-no-more-than-M" represented in jparsec?
- How is "repeat N times" represented in jparsec?
- How is "optional" represented in jparsec?
- How do I say "the string cannot be like ..."?
- How do I do "lookahead"?
- What about terminals?
- What is Pattern?
- How is "EOF" represented in jparsec?
- What is "Monad"?
- What is "combinator"?
- What is "fail"?
- What is "retn"?
- What is "zero"?
- What is "plus"?
- What is "map"?
- What is "bind"?
What is jparsec?
Jparsec is a recursive-descent parser combinator framework written for Java. The name "parsec" (parser combinator) is borrowed from the parsec library of the Haskell programming language.
Using jparsec, grammar can be written declaratively in a EBNF-like fashion. For example: Scanners.isChar('a').many() is equivalent to the EBNF notation "'a'*".
Why another parser framework?
There are excellent parser generator frameworks such as JavaCC, ANTLR already out there. Why re-inventing wheels?
Well, all those frameworks today are parser generators.
- You'll need to learn and write additional BNF, EBNF files to use them.
- It's hard to debug "grammar errors" (where a seemingly innocent mistake in the EBNF file may cost you days to catch it)
-
Limitations to reuse grammar.
"*", "+" are normally pre-built in the framework, because they are the most frequently used grammar constructs. But there are other constructs that may be worth reusing.
What if you want to reuse "repeat 2-n times"? What if "ABAB+A" is a recurring pattern in our grammar?
- Difficulty in supporting dynamic grammar (or context-sensitive grammar) where previous parsing result determines the following grammar.
JParsec is a parser combinator framework where:
- no extra grammar file is required.
- All grammars are written in the Java language as objects. Because of this, debugging of grammar and re-using complex grammars are easier in JParsec.
- Dynamic grammar is supported.
We could create a grammar for "AAA..." where the number of repetition is a variable.
This variable can even be a previous parsing result. - Using dynamic grammar, operator precedence grammar is supported. And the precedence can even be determined dynamically at run-time.
How do I use jparsec?
jparsec is very simple to use. All you have to do is to create a Parser object representing the grammar and run it using Parsers.runParser().
For example:
Parsers.runParser("aaa", Scanners.isChar('a').many(), 1, "test parser);
- The first parameter of runParser is the string containing the text to parse.
- The second parameter of runParser is the Parser object representing the grammar.
- The third parameter of runParser is the width of a "horizontal tab" character. This is needed for accurate error reporting.
- The final parameter of runParser is the module name. You can use anything meaningful. It is also needed for error reportig.
- The return value is an object created by the parser. A calculator, for example, will return the calculated result.
- In case of parse error, a ParserException is thrown which contains all the relevant error message, including line number, column number and module name.
What is Parser?
Class Parser is the most important class in jparsec. It represents a grammar.
Using the rich set of combinators provided in class Parser and Parsers, one can create sophisticated grammar
by combining simpler Parser objects.
A Parser object can run at either a character level or at token level.
- A character level parser is called a scanner. It accepts as input a CharSequence object and operates on it.
- A token level parser is called a parser. It accepts as input an array of token objects and operates on these tokens.
How does jparsec handle left-recursion?
Left-recursion is hard to handle in recursive-descent parser. jparsec is no exception. You have to avoid left-recursion in your grammar. Nontheless, jparsec does provide a few combinators that handles some normal left-recursion for you. For example, a left-associative binary operator is naturally represented in a left-recursive notation in EBNF:
ExprPlus ::= <ExprPlus> '+' <ExprMulDiv>jparsec provides a infixl() combinator that handles left associative binary operator for you:
final Parser plus_op = Scanners.isChar('+') .seq(Parsers.retn(new Map2(){ public Object map(Object o1, Object o2){ return new Integer(((Integer)o1).intValue()+((Integer)o2).intValue()); } }) ); final Parser exprplus = Parsers.infixl(plus_op, exprmuldiv);
- The parser for a binary operator should return a Map2 object.
- Map2 is an interface that's responsible for transforming two objects to a single object. In the example, the Map2 object for the '+' operator transforms two integer objects to the sum.
- Parsers.retn() is a parser that does not consume input but directly return a value.
- Parser.seq() is a sequencing operator. It calls the second parser object after the first is completed. In the example above, the Map2 object is returned after the character '+' is consumed from the input.
- The first parameter of infixl() is the parser for the binary operator. It has to return a Map2 object.
- The second parameter of infixl() is the parser for operand. Two return values of this parser will be transformed by the Map2 object returned by the operator parser.
Does jparsec support operator precedence grammar?
Yes. jparsec has built-in mechanism for you to build operator precedence grammar for expressions.final OperatorTable table = new OperatorTable() .infixl(plus_op, 10) .infixl(minus_op, 10) .infixl(mul_op, 20) .infixl(div_op, 20) .prefix(negate_op, 50); final Parser expr = Expressions.buildExpressionParser(Lexers.integer(), table);
- OperatorTable is a class used to describe all the operators and their precedences.
- infixl(plus_op, 10) is saying that the plus operator is left associative and binary. Its precedence 10.
- prefix(negate_op, 50) is saying that the unary negate operator is prefix. Its prededence is 50, which means it has higher precedence than plus operator.
- The first parameter of buildExpressionParser is the parser for the operand. In this case, a parser that recognizes and returns integer.
- The second parameter of buildExpressionParser is the OperatorTable that describes the operators and the precedences.
- buildExpressionParser builds a parser that parses the expression with the provided operand and operators.
What is "Parsers.one()"
Parsers.one() does not consume any input and always succeed regardless of the input.
How is EBNF sequencing represented in jparsec?
The Parser.seq() function is used to represent the sequencing concept.
A ::= B Cis represented in jparsec as:
Parser a = b.seq(c);Sometimes the return value from the first Parser is still needed, and() can be used in this case:
Parser a = b.and(c, new Map2(){ public Object map(Object o1, Object o2){ ... } });
- The second parameter of and() is a Map2 object that transforms the return value of the first and the second Parser object.
How is EBNF alternation represented in jparsec?
Parser.plus() is used for alternating.
A ::= B | Cis represented in jparsec as:
Parser a = Parsers.plus(b, c);
How is kleen star (zero or more) represented in jparsec?
The many() combinator.
A ::= B*is represented in jparsec as:
Parser a = b.many();
How is "one-or-more" represented in jparsec?
The many1() combinator.
A ::= B+is represented in jparsec as:
Parser a = b.many1();
How is "N-or-more" represented in jparsec?
The many() combinator can optionally accept an int parameter that specifies the minimal number of occurrence.
A ::= B B+is represented in jparsec as:
Parser a = b.many(2);
How is "no more than N" represented in jparsec?
The some() combinator can be used to represent this concept.
A ::= Epsilon | B | B B | B B Bis represented in jparsec as:
Parser a = b.some(3);
How is "N-or-more-but-no-more-than-M" represented in jparsec?
The some() combinator can optinally accept another int parameter that specifies the minimal number of occurrence.
A ::= B B | B B Bis represented in jparsec as:
Parser a = b.some(2, 3);
How is "repeat N times" represented in jparsec?
The repeat() combinator can be used to represent this concept.
A ::= B B Bis represented in jparsec as:
Parser a = b.repeat(3);
How is "optional" represented in jparsec?
The optional() combinator can be used to represent this concept.
A ::= B?is represented in jparsec as:
Parser a = b.optional();
How do I say "the string cannot be like ..."?
The not() combinator can be used to represent this concept.
Parser a = b.not();
How do I do "lookahead"?
jparsec is LL(k). That means look-ahead is sometimes useful to avoid ambiguity. The Parser.lookahead() combinator can be used to do this task.
Parser a = Parsers.plus(b.lookahead(2), c);
- a.lookahead(2) tells Parsers.plus() that a 2-token look-ahead is needed to determine which alternative to go. Thus, ambiguity can be avoided if the first input matches both b and c.
What about terminals?
class Scanners provides terminal parsers that scan a CharSequence object;
class Lexers provides terminal parsers that scan a CharSequence object and return a value corresponding to the recognized character sequence.
For example: Scanners.isChar('a'), Scanners.isString("abc"), Scanners.isLineComment(), Lexers.stringLiteral(), Lexers.integer(), Lexers.decimal() etc.
What is Pattern?
Similar to a scanner, a Pattern object is also responsible for scanning a character sequence based on an expected string pattern.
It is different than a scanner in that no error reporting is available. The pattern either matches or mismatches.
A javax.util.regex.Pattern can be adapted to Pattern.
Pattern can be used to create sophisticated terminal scanner.
How is "EOF" represented in jparsec?
Parsers.eof() can be used for this purpose. It only succeeds when the end of the input is encountered.
It can be used to ensure that the entire input matches a grammar.For example, for input "abc", Scanners.isChar('a') will succeed, but Scanners.isChar('a').seq(Parsers.eof()) will fail because it expects no input after character 'a'.
What is "Monad"?
Theoretically, "Monad" is a functor borrowed from the Category Theory.
Without understanding the mathematical background of it,
today's functional programmers are using monad for various complex tasks.
See All About Monads for further understanding about monad.
Intuitively, it is useful to think of a monad as a strategy for combining computations into more complex computations.
Talking about parsec, the Parsers.retn() and Parser.bind() together constitute a Monad.
What is "combinator"?
Combinators are various functions that combine computation objects to create more complex computations.
Talking about parsec, many functions defined in Parsers, Scanners and Parser class combine Parser objects to create more complex Parser objects, therefore, are combinators.
What is "fail"?
"fail" is an optional monad operation that indicates a failure.
Parsers.fail() creates a Parser object that simply report failure regardless of the input.
Different from Parsers.zero(), fail allows you to specify the error message.
What is "retn"?
Parsers.retn() is the "return" operation of Monad.
This Parser object ignore the input and always return an object as its result.
What is "zero"?
Parsers.zero() is similar to "fail" except it does not allow specification of the error message.
"zero" and "plus" together with the monad operations constitute the "MonadPlus" in Haskell,
which, represents monad computations that may fail in one computation and try an alternative computation.
What is "plus"?
Parsers.plus() creates a Parser object that will try an 2nd Parser object when the 1st Parser object fails without any input consumption.
"plus" and "zero" together with the monad operations constitute the "MonadPlus" in Haskell,
which, represents monad computations that may fail in one computation and try an alternative computation.
What is "map"?
"map" is a combinator function that creates a Parser object that transforms the result of another Parser object.
For example, the following Parser uses "map" combinator to transform the result of parser1 from a string to integer.
Parser return_int = some_parser_returning_string.map( new Map(){ public Object map(Object str_result){ return new Integer(Integer.parseInt((String)str_result)); } } );
What is "bind"?
"bind" is the most important operation for any Monad.
It is responsible for combining subsequent computations.
For example, the following Parser uses "bind" combinator to dynamically determine the next Parser object to run:
Parser return_int = some_parser_returning_string.bind( new Binder(){ public Parser bind(Object str_result){ if(str_result instanceof String){ return new Integer(Integer.parseInt((String)str_result)); } else{ return Parsers.fail("string result expected"); } } } );