Built-In String Features in IdExtenso
March 29, 2024 | Tips | en
If you are one of the happy users of the IdExtenso scripting framework for InDesign, you may have noticed — or overlooked! — that this enhanced version of ExtendScript provides many additional services, including in the primitive areas of the language. For example, any string immediately has functionalities like trim()
, codePointAt()
, toUTF8()
, which aren't available in the root syntax. Let's take a closer look at this toolbox…
Clean ASCII Uneval
• String.prototype.toSource(?quoteChar) -> string
Returns a safe, quote-enclosed ASCII string in the form "..."
(default formatting) such that eval(str.toSource())===str
. You can set the quoteChar
argument to a single quote character ("'"
aka "\x27"
) to get a '...'
enclosing instead. This function overrides the original ExtendScript method in order to yield shorter outputs:
Input String | ExtendScript Output | IdExtenso Output |
---|---|---|
"abc" |
(new String("abc")) |
"abc" |
"\\\r\n\t\v\f\0" |
(new String("\\\r\n\t\x0B\f\x00")) |
"\\\r\n\t\v\f\0" |
"àbçdé" |
(new String("\u00E0b\u00E7d\u00E9")) |
"\xE0b\xE7d\xE9" |
Tested on a JPEG file owning 23,052 bytes, IdExtenso's toSource()
returns 60,433 characters while the native method needs 89,653 characters. You save almost 30K!
Trimming, Truncating, Padding
• String.prototype.[trim|ltrim|rtrim]() -> string
These three very popular methods, missing from ExtendScript, allow you to remove spaces at the ends of a string. ltrim()
left-trims the string, rtrim()
right-trims the string, and trim()
applies both left- and right-trimming.
Note. - All space characters available in InDesign and Unicode are targeted, including U+205F MEDIUM MATHEMATICAL SPACE.
var s = " \t\u2000 Hello World \xA0\u2028"; alert( s.trim().toSource() ); // => "Hello World" alert( s.ltrim().toSource() ); // => "Hello World \xA0\u2028" alert( s.rtrim().toSource() ); // => " \t\u2000 Hello World"
• String.prototype.stripSpaces() -> string
Strips all space characters from a string.
var s = "\tHello World !\u2000"; alert( s.stripSpaces().toSource() ); // => "HelloWorld!"
• String.prototype.[trunc|ltrunc|rtrunc](size, ?ellip, ?wb) -> string
These methods removes either the MIDDLE (trunc
), LEFT (ltrunc
), or RIGHT (rtrunc
) part of a string according to the maximum size
parameter (uint).
— If the string is already shorter than size
, it is returned as is. Otherwise, at most size
characters are kept.
— By default, ellip
(ellipsis) is set to three dots (...
) but you can specificy here any custom string as 2nd argument.
— The wb
argument (boolean, optional) tells whether the result must preserve word boundaries.
var s = "And this Fyodor Pavlovich began to exploit; that is, he fobbed him off with small sums."; var t = s.rtrunc(25, "…", true); // Detect word boundaries alert( t ); // => `And this Fyodor…`
• String.prototype.[rpad|lpad](size, ?padChar) -> string
Extends the RIGHT (rpad
) or LEFT (lpad
) of the string using a padding character (space by default) until the length reaches size
.
alert( "abc".rpad(5).toSource() ); // => "abc " alert( "abc".lpad(5, '_').toSource() ); // => "__abc"
Code Point Manager
• String.fromCodePoint(array) -> string
This static method implements ECMAScript's String.fromCodePoint
function. Pass in either a simple array of code points (numbers in 0..0x10FFFF
), or a list of code points (arguments). The function returns the UTF16-encoded string.
var s = String.fromCodePoint([0x61, 0x28FF0, 0x62]); alert( s.toSource() ); // => "a\uD863\uDFF0b"
• String.prototype.codePointAt(position) -> number
Implements ECMAScript's String.prototype.codePointAt
function, which returns the code point (0..0x10FFFF
) found at the supplied position (uint). In addition, the function's SIZE property is set to the number of consumed code units (0:None
; 1:RegularCharCode
; 2:Surrogate
.)
alert( "012".codePointAt(1) ); // => 0x31 alert( "a\uD863\uDFF0b".codePointAt(1) ); // => 0x28FF0 alert( "a\uD863\uDFF0b".codePointAt(2) ); // => 0xDFF0 ; in surrog.
UTF8 Converter
• String.fromUTF8(string-or-array) -> string
Given a sequence of valid UTF8 codes (string or array), rebuilds and returns the original UTF16 string.
var s = String.fromUTF8("\xC3\x80\xC3\x89\xC3\x94"); // or: String.fromUTF8([0xC3, 0x80, 0xC3, 0x89, 0xC3, 0x94]); alert( s ); // => `ÀÉÔ` alert( s.toSource() ); // => "\xC0\xC9\xD4"
• String.prototype.toUTF8() -> string
Converts this string (assumed in native UTF16) into UTF8. The result is then formed of characters whose codes are all <= 0xFF
. Keep in mind that the output string is in a “transport format” for encoding purpose—it shouldn't be displayed as such!
var utf8 = "ÀÉÔ".toUTF8(); alert( utf8.toSource() ); // => "\xC3\x80\xC3\x89\xC3\x94"
Base64 Decoder/Encoder
• String.fromBase64(string-or-array, ?AS_BYTES) -> string
Given a sequence of valid Base64 codes (string or array), reconstructs and outputs the original (JavaScript) string. By default, the outcoming bytes are considered UTF8 units and then converted into UTF16. If the boolean flag AS_BYTES
is set, the function returns the bytes without processing UTF8-to-UTF16 conversion.
Note. - B64 codes are ASCII characters in the set A-Za-z0-9+/=
.
var s = String.fromBase64("SW5kaXNjcmlwdHM="); alert( s ); // => `Indiscripts`
• String.prototype.toBase64(?AS_BYTES) -> string
Convert this string into Base64 code. The result is always a string formed of B64 characters. By default, the this
string is regarded as a full UTF16 string, so it is converted into UTF8 bytes and then passed to the B64 converter. If AS_BYTES
is set, the method bypasses UTF16-to-UTF8 conversion and treats each incoming character as a byte. (Thus, if the string contains units greater than 0xFF
, only the 8 lowest bits are kept.)
var b64 = "Indiscripts".toBase64(); alert( b64 ); // => `SW5kaXNjcmlwdHM=`
ExtendScript Patches
• String.prototype.indexOf(search, ?pos) -> integer
In older ExtendScript versions, str.indexOf(search)
might not work when str contains U+0000 before the match and search has more than one character. This bug is solved in IdExtenso.
alert( "\0\0ABC\0XX".indexOf("ABC") ); // => 2 (all versions) alert( "\0\0ABC\0XX".indexOf("ABC",3) ); // => -1 (all versions)
• String.prototype.lastIndexOf(search, ?pos) -> integer
In CS4, str.lastIndexOf('\0')
wrongly returns the length of the string! This bug is solved in IdExtenso.
alert( "abcd".lastIndexOf('\0') ); // => -1 (all versions) alert( "\0\0".lastIndexOf('\0') ); // => 1 (all versions)
• String.prototype.split(separator, ?limit) -> array
Although split
has been fixed in higher versions, the method fails in ExtendScript CS4 when U+0000 is involved at some point. It then yields weird results. This bug is solved in IdExtenso.
alert( "aei\0abc\0\0xyz\0".split('\0') ); // => ["aei", "abc", "", "xyz", ""] alert( "aei\0abc\0\0xyz".split(/[ab\x00]+/) ); // => ["", "ei", "c", "xyz"]
• String.prototype.charAt(pos) -> string (char)
In JavaScript, charAt
can pick a U+0000 character, e.g. "x\0y".charAt(1)
returns "\0"
. But in ExtendScript an empty string is returned whenever charAt
should yield "\0"
. This issue is solved in IdExtenso.
var c = "x\0y".charAt(1); alert( c.toSource() ); // => "\0"
Miscellaneous
• String.random(len) -> string
This static method produces a random string of length len
(default: 4) matching the pattern /[a-z][0-9a-z]*/
. It is very useful for generating random IDs.
alert( String.random() ); // => e.g `i1x4` alert( String.random(16) ); // => e.g `gj1duwcgsqk9t8fz`
• String.levenDist(string1,string2) -> uint
Measures the difference between two strings string1
and string2
using the Levenshtein distance algorithm. The returned value is an unsigned integer.
alert( String.levenDist("Indiscripts", "indiscripts") ); // => 1 alert( String.levenDist("Adobe", "Acrobat") ); // => 4 alert( String.levenDist("InDesign", "Photoshop") ); // => 8
Note. - A more sophisticated routine, String.levenFilter(...)
is also provided, which builds an sub-array of strings based on a reference array, an incoming string and a maximum Levenshtein distance. See the code for further details.
• String.prototype.charSet(?KEEP_ORDER) -> string
Returns (as a string) the set of all characters present in this string. By default, the returned string is UTF16 ordered, unless the KEEP_ORDER
flag is true. This function is useful to determine the entire character set that your text data (story, document, etc.) actually requires.
var s = "Hello_Wonderful_World!"; alert( s.charSet() ); // => `!HW_deflnoru` alert( s.charSet(true) ); // => `Helo_Wndrfu!`
• String.prototype.unaccent() -> string
Removes the accents of a string. This methods supports basic diacritics of Latin, Greek, Cyrillic and Hebrew alphabets.
Note. - Ligatures like œ
or ij
ARE NOT converted into digrams. A more advanced routine might be implemented for that purpose.
alert( "ÀçĎéĩĵĶńőŕşūŵŷż".unaccent() ); // => `AcDeijKnorsuwyz` alert( "ΐΫάέή".unaccent() ); // => `ιΥαεη` alert( "ӝӟӥӫӵӛ".unaccent() ); // => `жзиөчә`
• String.prototype.subReplace(what, repl, where, OUTSIDE) -> string
Replaces what
(string or RegExp) by repl
(string or function) inside or outside the substrings captured by where
(RegExp). This method performs replacements only in specific areas determined by a regular expression:
— if OUTSIDE
is false or missing, replacements are processed in every substring captured by where
(the outside is preserved.)
— if OUTSIDE
is true, replacements are processed out of the substrings captured by where
(the inside is preserved.)
The 1st and 2nd parameters are defined as in String.prototype.replace()
and have the same meaning and behavior. The regular expression where
only delineates the scope of replacement. It may involve multiple substrings if the /g
global flag is set; otherwise it will capture at most one matching substring.
var src = "abc<def><ghi>-mno<stu>"; var what = /[aeiou]/gi; var where = /<[^>]+>/g; // Replace vowels with # only in `<...>` areas var r = src.subReplace(what, '#', where); alert( r ); // => `abc<d#f><gh#>-mno<st#>`
• String.prototype.asPath() | ...toPath(str) | ...relativePath(str)
These three methods handle POSIX paths based on the slash separator /
and the conventional shortcuts ..
(double dot) and .
(dot). Although they are perfectly usable in your own code, they are primarily intended as internal IdExtenso routines.
• IdExtenso: github.com/indiscripts/IdExtenso
• Implementation of the String extensions
• Sample scripts (for newbies)