Let's imagine that the publisher would submit to you a list of terms to be indexed in the following form (the → character indicating a cross-reference):

Indexing instructions provided by the client.

I've reduced the realistic size of such a list, but it does a good job of capturing the parameters you'll be dealing with in practice. First, you need to replace the “A → B” schemas (x-refs) with their equivalent in IndexMatic syntax:

   // A => B

This should be easy as pie for your text editor.

Converting cross-references into IndexMatic queries.

Regarding actual queries, an immediate fact should then jump out at you. In almost all cases, a comma divides the input line into two parts. The prefix part then almost always constitutes a sufficient key to unambiguously identify this element in the document — provided that it does not collide with homographs.

For example, we shall assume that Aaron matches the entry “Aaron, Jeff (1936-2001)”, as does Alió-Barràs for “Alió-Barràs, Karl (1961)”, and so on:

Highlighting key components before the comma.

There are, however, three exceptions to the general scheme:

   Diophantos von Alexandria (ca. 250 n. Chr.)
   Russell (1855)
   Voltaire (Arouet, François-Marie) (1694-1778)

In the first two cases, no comma appears, which prevents us from extracting a relevant key. In the last case, the comma only occurs after the opening parenthesis — “Voltaire (Arouet,…” — which tends to produce a key that is too greedy (we would prefer Voltaire alone).

But we notice that in those circumstances, it is the opening parenthesis (preceded by a space character) which plays the role of optimal separator. So that the combination of the two options (PARENTHESIS or COMMA) produces a criterion always revealing the end of the search key. In regex code, the separator pattern that interests us is therefore expressed:

   (\s\(|,)

(translate: “any space followed by an opening parenthesis, or comma.”) If we mentally apply this separator to the original list, we see (in blue below) all the KEYS that our query list should ultimately look for.

In blue, our actual search keys.

Remember this rule. We will only search the document for the elements in blue, but for each match we will associate the complete line as written by the client. Take for example the expression “Alió-Barràs, Karl (1961)”. We actually want our Query List to digest it like this:

   /Alió-Barràs/Ih => Alió-Barràs, Karl (1961)

(in order to improve my regular expression, I just added the case sensitivity and generic hyphen options, hence the /Ih flags.)


Now that our objective is clear, the whole question is to implement this process by modifying the existing list as little as possible. This is precisely the purpose of the ~format (or ~split) directives.

Although ~split often works very well when it comes to splitting elements of the input line, I shall use ~format here because it gives more control over the incoming patterns and will allow the special cross-reference syntax to be bypassed.

IndexMatic³ Manual: ~format Directive Reference

The first argument to pass to ~format is a regular expression (ExtendScript dialect) that will parse any input line. The idea is to catch the portions that interest us in capturing parentheses, then rearrange them in order to produce the actual query, the one that will be received by IndexMatic's interpreter.

In this case, we want to capture all characters BEFORE the separator pattern (\s\(|,) since this will deliver our search key. But we want to capture this prefix under a double constraint:

    1. On the one hand, using a “non-greedy” quantifier which guarantees a minimal match;

    2. On the other hand, rejecting the slash character (/) so as to completely ignore the form //... which starts comments or cross-references.

The combination of all our criteria results in the regular expression

   /^([^\/]+?)(\s\(|,)/

which can be broken down as follows:

Regex passed to the ~format directive.

We are only interested in the first captured element. In the directive output scheme, it is symbolized by ^1, while the complete line in its initial state is symbolized by ^0. The query we want to produce therefore looks like /^1/=>^0, to which we'll add security flags to manage case-sensitivity, generic space and generic hyphen: /^1/Ish=>^0.

This results in the complete directive that we will insert at the very beginning of the list (note the :: separator between the regex and the output pattern):

   //~ format /^([^\/]+?)(\s\(|,)/ :: /^1/Ish => ^0

You may spend some time coming up with this command—which of course needs to be adapted to the formatting and punctuation of your own file—but remember that once you do that, the list instantly works as a pure Query List without any further editing. This means that it will act on all lines following it (unless it encounters a blank line or a stop directive), regardless of whether there are 100, 500, 1000 or 5000 items to process.

Note. — We even gave ourselves the ability to accept cross-references among incoming data. This little trick relies on the fact that when the ~format directive finds no match in the current line, it simply keeps the line as is and moves on to the next one.

So here is the list of queries ready to be loaded into IndexMatic³:

// IndexMatic3 Query List
// ---

//~ format /^([^\/]+?)(\s\(|,)/ :: /^1/Ish => ^0
Aaron, Jeff (1936-2001)
Alió-Barràs, Karl (1961)
// Ambroise => d’Ambroise
// Arouet => Voltaire
Béraud, Paul (père ~) (1705)
Bertholon de Lazare, Jacques (†1800)
Böhr, Walter (1906-1985)
Calmant, ? (†1840?)
Cochrane, Billy (1898-1938)
Dalla Volpi, Danilo (1925-1995)
d’Ambroise, Lucien (1967)
Diophantos von Alexandria (ca. 250 n. Chr.)
Ehner, Mike (1959)
Fermat, Pierre de (1607-1665)
Frézier, Amédée-François (1682-1773)
Guretzky-Cornitz, Bernhard von (1838-1873)
// Henry => Russ
Jänisch, Carl Ferdinand von (1813-1872)
Krebb, Daniel (1947)
La Croix de Castries, Charles E. Gabriel de (1727-1801)
Laplace, Pierre-Simon (1749-1827)
// Linde => van der Linde
Łuczak, Mario (1939-2009)
McKay, Brendan Damian McKay (1951)
Newton, Isaac (1643-1727)
O’Connor, Joseph (1939)
Parejo Casañas, Francesco N. (?) (1890-1983)
Rabelais, François (ca. 1494 [1483?]; †1553)
Russ, William Henry (1833-1866)
Russell (1855)
Schoumoff, Ilja (Stepanowitsch) (1819-1881)
Talleyrand-Perigord, Charles-Maurice de (1754-1838)
Ukers, William Harrison (1873-1945)
van der Linde, Antonius (1833-1897)
Vandermonde, Alexandre-Théophile (1735-1796)
Voltaire (Arouet, François-Marie) (1694-1778)
Witcomb, H. (1846)
Ximénès, Augustin-Louis, marquis de (1728-1817)
//~
 
// End of the directive scope
// Add additional queries if needed

There is probably no situation where ~format will cover all queries in your index. More subtle regular expressions are still needed to sort out homonyms, group spelling variations, and/or optimize search keys. However, directives provide a great batch processing tool when the vast majority of your index entries have a predictable, simple, and consistent structure.