Reconsidering Array.sort( ) in ExtendScript/JavaScript — Part 1
May 13, 2015 | Tips | en
The pattern array.sort(compareFunc)
is widely used by JavaScript programmers when elements being sorted are not supposed to represent basic strings in the sense of UTF16 ordered character sequences. In particular, sorting numbers forces you to provide that stupid custom compareNumbers
routine which, in most cases, just looks like function(x,y){return x-y}
. And even when you have to sort actual strings, there are countless situations where the default Unicode code point order is inappropriate. But a question arises: what is the performance cost of using a non-native comparison routine?
Background
As many other JavaScript engines, ExtendScript is extremely fast at comparing and processing strings in their natural state. In a recent project I had to implement a diff utility for InDesign document contents. I then realized how fast is the basic test string1==string2
even with huge sequences. Once a long document—say Artamène, or Cyrus the Great—has been stringified from the DOM, comparing, searching and extracting data cannot be done faster than from pure string features.
You may think—as I did—that splitting strings into more sophisticated structures (arrays, lists, graphs or tries…) can help fight redundancy, reduce memory overheating or performance issues. But what is true in theory becomes wrong in practice as soon as you leave aside primitive JS tools. And the reason is simple. No matter how powerful your algorithm is, your code remains interpreted while native functions are compiled. So do not defy the power of string1==string2
by introducing any alternate binary-search home-made algorithm. You lose.
Timing
Similar reasoning can be applied when it comes to Array.prototype.sort
. The native behavior of this method is somehow hardcoded in some C++ quicksort routine, highly optimized through Boost or whatever, and already waiting for running internal pointers on your string array.
However, as ExtendScript has been brilliantly designed—at least to my opinion!—by Michael Daumling, the cost of invoking a custom callback function rather than the hardcoded one does not seem prohibitive at first. Indeed, since comparing strings through s1 < s2
is natively ultra-fast, a custom
function compareStr(s1,s2) { return s1 < s2 ? -1 : +(s1 != s2) }
will run very fast too. Hence, we don't expect to notice a serious gap on the timer when comparing myArray.sort()
against myArray.sort(compareStr)
.
// ===================================== // Native vs. Custom Sort on strings // using an equivalent compare function // ===================================== const N = 1e3, // number of items MAGNITUDE = 0xFFFF; var myStrings = new Array(N), aNative, aCustom, rNative, rCustom; // Feed short random strings. // --- (function(a,k,i) { for( i=a.length ; i-- ; a[i]=(k*Math.random()).toString(36) ); })(myStrings,MAGNITUDE); // Native process timing. // --- aNative = myStrings.concat(); // clone $.hiresTimer; aNative.sort(); rNative = $.hiresTimer; // Custom process timing. // --- aCustom = myStrings.concat(); // clone $.hiresTimer; aCustom.sort(function(x,y){ return x < y ? -1 : +(x != y) }); rCustom = $.hiresTimer; // Results. alert( [ "Time Ratio: " + (rCustom/rNative).toPrecision(2), "Equality check: " + (aNative.toSource()==aCustom.toSource()) ].join('\r') );
My tests show that the native sort runs two to three times faster in the above context—1,000 items—which is yet far from negligible. The time ratio drops to about 1.5 when 10,000 strings are inserted (changing N
to 1e4
), while by contrast it becomes really significant for shorter sets. In other words, connecting a pretty fast compare routine is all the less noticeable when the amount of data increases, because sorting 10,000+ units will anyway cause an internal heat that takes precedence over that ridiculous callback stuff.
But my point is that on everyday cases—and sorting 1,000 strings or less sounds to me an everyday case—the native sort is significantly faster even when the alternative approach is very well optimized. To claim that either way the absolute execution time remains very short is not a relevant argument to me because three times faster at the scale of a millisecond still means three times faster at the scale of a second, or a minute, when that same routine is called again and again from some loop or any other parent module.
Counting
There is an interesting question behind this: how many times does a comparison occur during a sort? According to experts, and assuming ExtendScript's implementation is more or less based on the Quicksort algorithm, the sort()
method should invoke about N×Log(N)
times the callback function on average cases. More precisely, this formula only gives an order of magnitude, modulo some nearly constant factor.
It's easy to get more concrete results running the following script with different values of N
from 10 to 100,000 (by increasing powers of 10.)
// ===================================== // Counting hits during a sort // ===================================== const N = 1e3, // test other values: 10, 100, 1e3, 1e4 MAGNITUDE = 0xFFFF; var myStrings = new Array(N); // Feed short random strings. // --- (function(a,k,i) { for( i=a.length ; i-- ; a[i]=(k*Math.random()).toString(36) ); })(myStrings,MAGNITUDE); // Add a counter to the compare function. // --- var compareStr = function F(x,y) { F.count++; return x < y ? -1 : +(x != y) } // Initialize the counter, then run. // --- compareStr.count = 0; myStrings.sort(compareStr); // Results. alert( "Hits: " + compareStr.count );
It is more convenient to represent the slope on a logarithmic scale. The script has been executed multiple times for each N in order to retrieve relevant average values. Let's call heat the number of hits divided by N. What is meaningful in that ratio? It somehow represents the number of callbacks (i.e. comparisons) that a single item does involve on average.
We can see that the heat is remarkably “linear” over this chart, which means, in fact, that the value is remarkably logarithmic relative to N. Each time you multiply by ten the number of array elements, the heat increases by 5 or 6 degrees. Furthermore, in absolute terms you can estimate that, given N, your compare function will run about 5×NLog(N)
times. And this approximation seems the most acceptable for N around 1,000—leading to ~15,000 hits.
Note. — By the way, never attempt to throw an error from your callback function in the hope of breaking the sort process. This would make InDesign crash.
Allow me to rephrase what we have found: let myStrings
be an unsorted array of about 1,000 items, compareFunc
be your custom compare function, then myStrings.sort(compareFunc)
will invoke compareFunc
about 15,000 times. Now suppose that compareFunc
consumes one millisecond on average. I bet you got the picture now. The entire sort will last at least 15 seconds.
Weighting
Customized compare routines often involve assigning weights to items so that comparing two items amounts to comparing the related “sort keys.” The task of either computing, managing or comparing these weights might be expensive. A very common example is the Unicode Collation Algorithm (UCA), whose main purpose is nothing but to perform language-aware string sorting. Another obvious case is sorting items which are not primarily strings, mainly objects (resp. arrays) having a dedicated property (resp. index) hosting the sort key.
Note. — ExtendScript implements String.prototype.localeCompare
but as far as I can investigate it doesn't do much more than returning this < what ? -1 : +(this != what)
.
The whole question boils down to this: provided that your compare criteria imply significant calculations for producing and/or retrieving the scalar weight of each item, is it serious to integrate those heavy operations inside a callback function that is invoked 5×NLog(N)
times?
I won't dwell on the recommended workaround regarding weight storage—for it is quite well debated everywhere else—but let's put it in a nutshell.
The above diagram illustrates the basic “time consuming” design pattern (!) for simple weights that can be coerced to numbers (and simple string items.) What should immediately shock you is that getWeight
is invoked about 30,000 times for an array of 1,000 elements. Therefore, the associated weight of a single item is, on average, recomputed 30 times! Which means that, within the entire sort duration, the part of the process that strictly regards weight calculation is stupidly stretched by a factor of 30. This issue becomes particularly worrying when computing a key is the warmest room in the house.
Fortunately, JavaScript provides an out of the box feature that solves this problem: “associative arrays,” that is, those powerful hash-tables instantly storing and maintaining string-to-value relationships. This native tool allows us to cache the results of calling getWeight(myString)
, then the cached weight is returned whenever it is already available for a given string. All you have to do is initialize an empty cache
object at some level of your implementation, which may even be high above the sorting area since string/weight pairs can be considered persistent data.
So you get to this picture, where the warm part of getWeight
(actual computing) has been isolated:
Warning. — In ExtendScript, reflect
is a reserved read-only property name for every object. Hence, when uncontrolled keys are expected to effuse, you must prevent occurrences of "reflect"
from undermining your code. A usual workaround is to prepend a special character, such as '\x01'
, to any incoming key, and then to use myKey.substr(1)
when you need to recover the clean string.
What we have learned, so far, is just optimizing the way we handle weights during the compare callback. Other improvements could be done in that field, but that's not the issue here. The issue is, we cannot change the fact that compareFunc
will run 15,000 times, so we cannot change the fact that the operation “getting-someone's-weight,” as fast as it is, will be processed 30,000 times within JavaScript's scope—while it actually targets 1,000 items! In addition, we chose a favorable scenario where comparing weights technically relies on subtracting numbers. (Things are not as simple in collation algorithms…)
But is there no way to precompute values so as to completely eliminate JS overheating?
Comments
Hi Marc,
Thanks for this interesting article. Looking forward to the discussion of your last question.
Interesting article indeed!
Thanks for your support, buddies!
@+
Marc