Context-Sensitive Tokenization With CongoCC

𝅘𝅥𝅮𝅘𝅥𝅮 I think a good guitar solo sounds so much better within the context of a good song.𝅘𝅥𝅮𝅘𝅥𝅮

Garry Moore

If you want to interpret a programming language or a DSL, there’s no way around using a parser generator. At the heart of FreshMarker is a template parser generated by the CongoCC parser generator. I can’t praise this JavaCC successor from Jon Revusky highly enough. Many new features and completely revised parts make the parser generator much more usable than its seemingly dead predecessor. CongoCC also helps wonderfully with a small current problem.

In my career, I have had to discover and replace many a homemade language. Often, their constructs were processed by line-by-line evaluation of commands using regular expressions. No matter how these languages were originally designed, the constant addition of new features turned the source code into a mess. In addition, the logic used to read the language was not separated from the actual functionality of the language. At some point, the point was reached where no further adjustments were possible with these languages. They were stuck, like an old, slightly too large cabinet in the door of a room.

CongoCC as a parser generator helps to elegantly circumvent these problems. The grammar and the source code generated from it separate language and parser code. Some of the CongoCC features help with this separation. Here, these are the parse tree and the inject mechanism. The parse tree allows the result to be further processed after parsing without losing important information about the language. A helpful pattern here is the visitor pattern. CongoCC’s inject mechanism allows you to customize the classes generated by the parser. This allows you to define additional methods or attributes or define additionally implemented interfaces.

In version 2.5.0, FreshMarker has gained lambda support as a new feature. This allows lambdas to be passed as parameters in some built-ins and one operator. There are a number of built-ins for sequences that make use of lambdas. However, a problem arose with the built-in filter.

The filter built-in generates a new sequence that contains all elements of the original sequence that meet a criterion. The criterion is passed as a parameter in the form of a lambda.

list?filter(n → n % 2 == 0)

In this example, a sequence of even numbers is generated because all odd numbers are filtered out.

The first implementation failed because filter is a reserved word in the Freshmarker template language. Reserved words are key elements of the language that distinguish different constructs from one another. The reserved word filter is used in the List Directive.

<#list 1..10 as s filter s % 2 == 0>
${s} * ${s} = ${s*s}
</#list>

There, filter has a similar function to the new built-in. The list used in the list directive is filtered at the beginning. This means that the use of the word filter is currently prohibited in all other places.

This is due to FreshMarker‘s CongoCC grammar. Here is the rule for the List Directive. It contains a whole range of reserved words: LIST, AS, SORTED, ASCENDING, DESCENDING, WITH, FILTER, OFFSET, and LIMIT.

List #ListInstruction :
    (<FTL_DIRECTIVE_OPEN1>|<FTL_DIRECTIVE_OPEN2>)
    <LIST><BLANK>
    Expression
    <AS>
    <IDENTIFIER>
    [
      [ <SORTED> (<ASCENDING> | <DESCENDING>) ]
      <COMMA> <IDENTIFIER>
    ]
    [ <WITH> <IDENTIFIER> ]
    [ <FILTER> Expression ]
    [ <OFFSET> Expression ]
    [ <LIMIT> Expression ]
    <CLOSE_TAG>
    Block
    DirectiveEnd("list")
;

The FreshMarker template language would already benefit if the reserved words AS, SORTED, ASCENDING, DESCENDING, WITH, FILTER, OFFSET, and LIMIT were only valid within the List Directive. Then they could be used elsewhere as IDENTIFIERS.

In fact, CongoCC can do this by first deactivating the token and then activating it at the appropriate place.

 List #ListInstruction :
    (<FTL_DIRECTIVE_OPEN1>|<FTL_DIRECTIVE_OPEN2>)
    <LIST><BLANK>
    ACTIVATE_TOKENS FILTER, OFFSET, LIMIT, SORTED, WITH, AS
    (
    Expression
    <AS>
    <IDENTIFIER>
    [
      [
        <SORTED>
        ACTIVATE_TOKENS ASCENDING, DESCENDING
        (<ASCENDING> | <DESCENDING>)
      ]
      <COMMA> <IDENTIFIER>
    ]
    [ <WITH> <IDENTIFIER> ]
    [ <FILTER> Expression ]
    [ <OFFSET> Expression ]
    [ <LIMIT> Expression ]
    <CLOSE_TAG>
    Block
    DirectiveEnd("list")
    )
;

The rule has changed only slightly: in the fourth line, the tokens FILTER, OFFSET, LIMIT, SORTED, WITH, and AS are activated and then apply from the opening bracket to the last closed bracket. In line 12, the two tokens ASCENDING and DESCENDING are activated. These two tokens therefore only apply in line 13. This change means that the names asc and desc can be used as IDENTIFIER everywhere. Only at this point are they recognized as ASCENDING and DESCENDING. Unfortunately, this is not the case for the other reserved words in the List Directive.

With the existing grammar, the restriction remains that the reserved words FILTER, OFFSET, LIMIT, SORTED, WITH, and AS cannot be used elsewhere within the List Directive. We will explore the solution we find for the List Directive another time.

Leave a Comment