Pratt’s parser API

The TDOP (Top Down Operator Precedence) parser implemented within this library is a variant of the original Pratt’s parser based on a class for the parser and meta-classes for tokens.

The parser base class includes helper functions for registering token classes, the Pratt’s methods and a regexp-based tokenizer builder. There are also additional methods and attributes to help the developing of new parsers. Parsers can be defined by class derivation and following a tokens registration procedure. These classes are not available at package level but only within module elementpath.tdop.

Token base class

class Token(parser, value=None)

Token base class for defining a parser based on Pratt’s method.

Each token instance is a list-like object. The number of token’s items is the arity of the represented operator, where token’s items are the operands. Nullary operators are used for symbols, names and literals. Tokens with items represent the other operators (unary, binary and so on).

Each token class has a symbol, a lbp (left binding power) value and a rbp (right binding power) value, that are used in the sense described by the Pratt’s method. This implementation of Pratt tokens includes two extra attributes, pattern and label, that can be used to simplify the parsing of symbols in a concrete parser.

Parameters:
  • parser – The parser instance that creates the token instance.

  • value – The token value. If not provided defaults to token symbol.

Variables:
  • symbol – the symbol of the token class.

  • lbp – Pratt’s left binding power, defaults to 0.

  • rbp – Pratt’s right binding power, defaults to 0.

  • pattern – the regex pattern used for the token class. Defaults to the escaped symbol. Can be customized to match more detailed conditions (e.g. a function with its left round bracket), in order to simplify the related code.

  • label – defines the typology of the token class. Its value is used in representations of the token instance and can be used to restrict code choices without more complicated analysis. The label value can be set as needed by the parser implementation (eg. ‘function’, ‘axis’, ‘constructor function’ are used by the XPath parsers). In the base parser class defaults to ‘symbol’ with ‘literal’ and ‘operator’ as possible alternatives. If set by a tuple of values the token class label is transformed to a multi-value label, that means the token class can covers multiple roles (e.g. as XPath function or axis). In those cases the definitive role is defined at parse time (nud and/or led methods) after the token instance creation.

arity
tree

Returns a tree representation string.

source

Returns the source representation string.

nud()

Pratt’s null denotation method

led(left)

Pratt’s left denotation method

evaluate()

Evaluation method

iter(*symbols)

Returns a generator for iterating the token’s tree.

Helper methods for checking symbols and for error raising:

expected(*symbols, message=None)
unexpected(*symbols, message=None)
wrong_syntax(message=None)
wrong_value(message='invalid value')
wrong_type(message='invalid type')

Parser base class

class Parser

Parser class for implementing a Top-Down Operator Precedence parser.

Variables:
  • symbol_table – a dictionary that stores the token classes defined for the language.

  • token_base_class – the base class for creating language’s token classes.

  • tokenizer – the language tokenizer compiled regexp.

position

Property that returns the current line and column indexes.

Parsing methods:

parse(source)

Parses a source code of the formal language. This is the main method that has to be called for a parser’s instance.

Parameters:

source – The source string.

Returns:

The root of the token’s tree that parse the source.

advance(*symbols, message=None)

The Pratt’s function for advancing to next token.

Parameters:
  • symbols – Optional arguments tuple. If not empty one of the provided symbols is expected. If the next token’s symbol differs the parser raises a parse error.

  • message – Optional custom message for unexpected symbols.

Returns:

The current token instance.

advance_until(*stop_symbols)

Advances until one of the symbols is found or the end of source is reached, returning the raw source string placed before. Useful for raw parsing of comments and references enclosed between specific symbols.

Parameters:

stop_symbols – The symbols that have to be found for stopping advance.

Returns:

The source string chunk enclosed between the initial position and the first stop symbol.

expression(rbp=0)

Pratt’s function for parsing an expression. It calls token.nud() and then advances until the right binding power is less the left binding power of the next token, invoking the led() method on the following token.

Parameters:

rbp – right binding power for the expression.

Returns:

left token.

Helper methods for checking parser status:

is_source_start()

Returns True if the parser is positioned at the start of the source, ignoring the spaces.

is_line_start()

Returns True if the parser is positioned at the start of a source line, ignoring the spaces.

is_spaced(before=True, after=True)

Returns True if the source has an extra space (whitespace, tab or newline) immediately before or after the current position of the parser.

Parameters:
  • before – if True considers also the extra spaces before the current token symbol.

  • after – if True considers also the extra spaces after the current token symbol.

Helper methods for building new parsers:

classmethod register(symbol, **kwargs)

Register/update a token class in the symbol table.

Parameters:
  • symbol – The identifier symbol for a new class or an existent token class.

  • kwargs – Optional attributes/methods for the token class.

Returns:

A token class.

classmethod unregister(symbol)

Unregister a token class from the symbol table.

classmethod duplicate(symbol, new_symbol, **kwargs)

Duplicate a token class with a new symbol.

classmethod literal(symbol, bp=0)

Register a token for a symbol that represents a literal.

classmethod nullary(symbol, bp=0)

Register a token for a symbol that represents a nullary operator.

classmethod prefix(symbol, bp=0)

Register a token for a symbol that represents a prefix unary operator.

classmethod postfix(symbol, bp=0)

Register a token for a symbol that represents a postfix unary operator.

classmethod infix(symbol, bp=0)

Register a token for a symbol that represents an infix binary operator.

classmethod infixr(symbol, bp=0)

Register a token for a symbol that represents an infixr binary operator.

classmethod method(symbol, bp=0)

Register a token for a symbol that represents a custom operator or redefine a method for an existing token.

classmethod build()

Builds the parser class. Checks if all declared symbols are defined and builds the regex tokenizer using the symbol related patterns.

classmethod create_tokenizer(symbol_table)

Returns a regex based tokenizer built from a symbol table of token classes. The returned tokenizer skips extra spaces between symbols.

A regular expression is created from the symbol table of the parser using a template. The symbols are inserted in the template putting the longer symbols first. Symbols and their patterns can’t contain spaces.

Parameters:

symbol_table – a dictionary containing the token classes of the formal language.