How to Emulate a Lookbehind in IndexMatic²
January 12, 2022 | IndexMatic² | en | fr
Although most GREP patterns are supported in IndexMatic², lookbehind is not. This operator allows you to capture a match only if some particular pattern occurs BEFORE it. Under certain circumstances you might need to build an index based on such fine-tuned queries. Let's see how to get around the wall…
First and foremost, remember that IndexMatic² is based on ExtendScript Regular Expressions, a dialect that supports positive and negative LOOKAHEAD assertions—using the schemes X(?=Y)
and X(?!Y)
, respectively—but cannot digest LOOKBEHIND schemes (?<=Y)X
and (?<!Y)X
to which GREP power users are accustomed.
Yet, the prefix elements of some expression may determine whether it is relevant or not for your index. Usually, capturing the full expression is enough to solve the problem. For example, the query /Prime Minister/
de facto excludes the isolated expression “Minister”. Thus, the LOOKBEHIND condition only makes sense when you need to capture an expression as long as it follows a certain pattern but without capturing that pattern.
Positive Lookbehind
Now suppose you are working on a product catalog such that every important name (the one you want to index) is always preceded by some 3-digit number and a pipe, like in 123|shelf. The prefix pattern \d\d\d\|
is the condition that makes the word shelf relevant, but you don't want to see that code in your final index.
In GREP syntax it would suffice to use something like (?<=\d\d\d\|)\w+
to grab the name that follows any possible prefix code. But this won't work in ExtendScript.
However, IndexMatic² can practically solve the problem. Just capture the whole pattern and output the relevant part:
/(\d\d\d\|)(\w+)/ => $2
The general form of a “fake lookbehind query” is (BEFORE)(MATCH) => $2
, where BEFORE stands for the prefix pattern that is required although hidden, MATCH being the visible part. Thanks to the capturing parentheses, the rule => $2
disregards the content of the first parenthesis (i.e, $1
) and outputs only the MATCH component (i.e, $2
.)
Note. — There is a subtle downside to this: IndexMatic² still reports the page number of the whole match. If by accident BEFORE belongs to page 2 and MATCH belongs to page 3, then the final index will claim that MATCH has been found on page 2, because the first character of the string that matches /(BEFORE)(MATCH)/
is indeed on that page.
Negative Lookbehind
Here we are asked to capture a term as long as it is not preceded by a certain pattern. This sounds like a completely different problem since no positive pattern governs the prefix part. Yet we can still extract the allowed term using the below trick:
(BEFORE MATCH)|(MATCH) => $2
where BEFORE MATCH
(1st capturing parenthesis) targets the form that we want to bypass, and MATCH alone (2nd parenthesis) represents a valid match. Thanks to the alternation, the variable $1
is set whenever BEFORE MATCH
is found in the text, while $2
will only receive a MATCH in the opposite case. But since the query only outputs $2
, any context that leads to fulfill $1
by satisfying the BEFORE MATCH
pattern is finally ignored. So you get exactly the effect of a negative lookbehind.
For illustration, let's reverse the condition of our previous example. Our goal is now to index the expressions of the form \w+
unless they are prefixed \d\d\d\|
. We then just have to run the IndexMatic² query
/(\d\d\d\|\w+)|(\w+)/ => $2
Note in passing that the scheme /(X)|(Y)/ => $2
has many other applications. Say you need to identify every occurrence of the word “token”, except those that are in quotes! The easiest way to do that is to capture both forms—making sure the greedy one is placed as the first alternative—then to output $2
:
/(["“'‘]token["”'’])|(token)/ => $2
Always the same principle: consume the undesired expression in the first block to prevent the second block from capturing it.