What Exactly is a Word?
September 04, 2011 | Tips | en
A number of InDesign scripts manipulate words for counting, indexing, or other processing purposes. Given a text container—basically a Story
—the InDesign Scripting DOM provides many ways to handle text contents through specialized subclasses of the Text
interface: one can easily access to insertion points, character ranges, lines, paragraphs, text columns, styled-text chunks, and… words. Although this concept seems pretty straightforward, I tried to understand a bit better what it really means.
A word is not a word!
First of all, an InDesign-oriented word is not a lexical unit. If you create a new text frame and enter a crazy string like: *;§!:-/~_%$».
, the frame is surprisingly regarded as containing a word, and a single one. That's what is displayed in the Info panel and this is confirmed by the following test:
// selected is a text frame that only contains: // *;§!:-/~_%$». var myTextFrame = app.selection[0]; alert( myTextFrame.words.length ); // => 1
Note that in this simple case we use myTextFrame.words
rather than myTextFrame.parentStory.words
, but of course the distinction is crucial when you deal with threaded text frames, or text frames that have overset contents.
Obviously, an empty story has no word, since it has no character. However, do not conclude that a story contains a word as soon as it contains a character. The Info panel is misleading on this point and, in fact, may display wrong counts. For example, if a text frame only contains space characters—say one tab and one simple space—the Info panel will claim:
Characters: 2
Words: 1 // this is wrong!
Lines: 1
Paragraphs: 1
whereas actually myTextFrame.words.length==0
.
So, what is the right rule? In the InDesign scripting perspective a Word
is a maximal range of characters that do not own any word separator (like space, tab, etc.). Words are just defined as the pieces between these non-word regions. Hence, the whole point is to identify what a word separator is.
To break or not to break (words)
We already know that the space and the tab character are word separators. It is not so easy to find other specimens! General punctuation characters—including comma, colon, semicolon, slash…—are not word separators. The figure below shows some (surprising) examples of character strings which count for one single word:
Note that special characters such as the footnote marker, the table placeholder, a text variable, or any object anchor, do not break words.
Finally, there are only three kinds of characters that are word separators: white spaces (including tab), break characters, and dashes (excluding hyphens). As I didn't find a complete list in Adobe documentation, I used the following script to identify every word separator:
var s = app.selection[0].parentStory, c = s.characters[1], u = 0, r = [], z = -1, t, OK; s.contents = "a_b"; for( u=0 ; u <= 0xFFFC ; ++u ) { try { OK = 0; c.contents = String.fromCharCode(u); OK = 1; } catch(_){} if( !OK ) continue; if( 1 < s.words.length ) { t = u.toString(16).toUpperCase(); while( t.length<4 ) t = '0'+t; r[++z] = 'U+'+t; } } alert( r.join('\r') );
Here is the result we obtain in CS4 and CS5:
List of InDesign Word Separators
CODE PT | UNICODE NAME | INDESIGN MEANING |
---|---|---|
U+0008 | <ctrl> BACKSPACE | Right indent tab. |
U+0009 | <ctrl> TAB | Regular tabulation. |
U+000A | <ctrl> LINE FEED | Forced line break. |
U+000D | <ctrl> CARRIAGE RETURN | Reflects several break characters, including paragraph return. |
U+0020 | SPACE | Usual space. |
U+0085 | <ctrl> NEXT LINE | Hidden character. (Behaves like a space.) |
U+00A0 | NO-BREAK SPACE | Nonbreaking space. |
U+1680 | OGHAM SPACE MARK | Hidden character. (Behaves like a space.) |
U+180E | MONGOLIAN VOWEL SEP. | Hidden character. (Behaves like a space.) |
U+2000 | EN QUAD | Hidden character. (Behaves like a space.) |
U+2001 | EM QUAD | Flush space. |
U+2002 | EN SPACE | EN Space. |
U+2003 | EM SPACE | EM Space. |
U+2004 | THREE-PER-EM SPACE | Third Space. |
U+2005 | FOUR-PER-EM SPACE | Quarter Space. |
U+2006 | SIX-PER-EM SPACE | Sixth Space. |
U+2007 | FIGURE SPACE | Figure Space. |
U+2008 | PUNCTUATION SPACE | Punctuation Space. |
U+2009 | THIN SPACE | Thin Space. |
U+200B | ZERO WIDTH SPACE | Discretionary Line Break. |
U+2013 | EN DASH | EN Dash (–). |
U+2014 | EM DASH | EM Dash (—). |
U+2028 | LINE SEPARATOR | Hidden character. (Behaves like a space.) |
U+2029 | PARAGRAPH SEPARATOR | Hidden character. (Behaves like a space.) |
U+202F | NARROW NO-BREAK SPACE | Nonbreaking Space (Fixed Width.) |
U+205F | MEDIUM MATH. SPACE | Hidden character, actually implemented though. |
A curious fact is that Hair Space (U+200A
), Non-Joiner (U+200C
), End Nested Style Here (U+0003
), and Indent To Here (U+0007
) do not act as word separators in InDesign.
Counting and extracting words
Due to the recursive structure of the document layout components, addressing the entire set of textual entities can be a real headache. The general strategy is to browse every Story
from the Document.stories
collection. This allows to exhaustively explore text contents at any sub-level of the document hierarchy, since any text is supposed to belong to a story. Well, this is almost true, but there are two critical exceptions: footnote and table contents are managed through special ‘strands’ which are not seen as story containers. That's why this usual word counter lacks footnotes and tables:
// Superficial Word Counter // (ignoring footnotes and table cells) alert( app.activeDocument.stories.everyItem().words.length );
Given a story, you need to inspect footnotes and table cells separately. And you have to use a recursive algorithm because both Cell
and Footnote
objects may contain nested table(s). Here is a generic utility that implements a deep word count, including footnotes and cells at every level:
// Deep Word Counter // (considering footnotes and tables) // IMPORTANT: // Like any digit sequence, each number that starts a footnote // counts itself as a word--unless you use an empty separator! var countWords = function F(/*Story|Cell|Footnote*/every) { var ret, t; every = every || app.activeDocument.stories.everyItem(); if( !every.isValid ) return 0; ret = every.words.length; t = every.texts && every.texts.everyItem && every.texts.everyItem(); if( !t ) return ret; t.tables.length && ret += F( t.tables.everyItem().cells.everyItem() ); t.footnotes.length && ret += F( t.footnotes.everyItem() ); t = null; return ret; }; alert( "Number of words: " + countWords() );
Note. — In the above code, the every
parameter is a specifier which may address Story
, Cell
, or Footnote
object. Thanks to the everyItem()
syntax, this specifier can also encapsulate a collective command, so the recursive countWords
function never needs to create, manage, and browse JavaScript arrays. Everything is done through the command subsystem, which I think improves the performance of the function.
Finally, turning our word counter into a word extractor is not too difficult:
// Deep Word Extractor // - considering footnotes and tables // - removing duplicates // DISCLAIMER: // This script is not optimized for long documents! var extractWords = function(MIN_LENGTH) { MIN_LENGTH = MIN_LENGTH || 2; var obj = {}, reSkip = /[\x00-\x1F\uFFFC\uFFFD]/g, cleanKeys = function(a) { var re = reSkip, i = a.length >>> 0, o = obj, k; while( i-- ) { k = a[i].replace(re,''); (MIN_LENGTH <= k.length) && o[' '+k]=null; } re = o = null; }, browse = function(every) { var t; if( !every.isValid ) return; every.words.length && cleanKeys( every.words.everyItem().contents ); t = every.texts && every.texts.everyItem && every.texts.everyItem(); if (!t ) return; t.tables.length && browse( t.tables.everyItem().cells.everyItem() ); t.footnotes.length && browse( t.footnotes.everyItem() ); t = null; }; browse( app.activeDocument.stories.everyItem() ); reSkip = cleanKeys = browse = null; var k, z = -1, r = []; for( k in obj ) { if( !obj.hasOwnProperty(k) ) continue; r[++z] = k.substr(1); } obj = null; return r; }; // TEST alert( "Words that contain 5+ characters:\r\r" + extractWords(5).sort().join(' | ') );
• See also:
— InDesign Special Characters;
— On ‘everyItem()’ – Part 1;
— On ‘everyItem()’ – Part 2.
Comments
M'est d'avis que tu bosses sur un nouveau script et que tu viens de soulever un nouveau lièvre…
Allez, je me lance… Marcup…?! Comprends qui peux
J'aimerais te répondre que tu mets dans le mille, mais la vérité est moins romanesque : il s'agit plutôt d'un billet d'arrière-cuisine sur le mode « retour aux fondamentaux ».
Cela dit, les deux ou trois gibiers qui sont soulevés dans cet article entretiennent, en effet, un rapport direct avec plusieurs scripts en préparation ;-)
@+
Marc
I visited multiple websites however the audio quality for audio songs present
at this web page is actually marvelous.