“Computer programming is an art, because it applies accumulated knowledge to the world, because it requires skill and ingenuity, and especially because it produces objects of beauty. A programmer who subconsciously views himself as an artist will enjoy what he does and will do it better.”
Donald Knuth
Some algorithms are real classics and yet nobody really knows them. Day in, day out, everyone is happy about the beautiful wrapping of their text in the word processor, but why does the text actually look so beautiful spread across the lines?
For line breaks, there is the usual simple ad-hoc approach and then the masterful variant. The basic idea with line breaks is to insert as much text into a line until the line is full. Then the next line is started. The text consists of character strings separated by spaces.
In the simplest form, the text is divided into individual words separated by spaces. As long as the line is not yet filled, i.e. the width of the words and spaces inserted so far does not exceed the line width, further words and spaces are inserted from the text. If the line is filled, the next word is inserted into a new line.
The examples shown here were generated by by my own Java implementations of an ad-hoc algorithm and the Knuth Plass algorithm. The implementation follows the description from the paper by Knuth and Plass from 1981, so this implementation took a little longer. Something similar has already happened to me with the implementation of the Double Array Trie. SVG generated with EchoSVG was chosen as the output medium so that the results can be easily presented here.
We will break the following text paragraph into individual lines. It comes from the paper by Knuth and Plass and illustrates some details of line breaking.
In olden times when wishing still helped one, there lived a king whose daughters were all beautiful;
and the youngest was so beautiful that the sun itself, which has seen so much, was astonished whenever it shone in her face.
Close by the king’s castle lay a great dark forest, and under an old lime-tree in the forest was a well,
and when the day was very warm, the king’s child went out into the forest and sat down by the side of the
cool fountain; and when she was bored she took a golden ball, and threw it up on high and caught it;
and this ball was her favorite plaything.
The first variant of the ad-hoc algorithm takes this text and produces the following left-aligned paragraph from it.
The text is quite frayed on the right-hand side because, for example, there is no more space for the word daughter
in the first line. Longer words are particularly annoying in such short lines of text because they cause premature line breaks.
With a small change to the previously described algorithm and a few small additions in the input text, the result of the line break can be significantly improved. These changes are the splitting of words at hyphens and the insertion of soft hyphen characters (\u00AD
) at positions where word splitting should be permitted.
In olden times when wish\u00ADing still helped one, there lived a king whose daugh\u00ADters were all beau\u00ADti\u00ADful;
and the young\u00ADest was so beau\u00ADti\u00ADful that the sun it\u00ADself, which has seen so much, was aston\u00ADished
when\u00ADever it shone in her face.
Close by the king’s castle lay a great dark for\u00ADest, and un\u00ADder an old lime-tree in the for\u00ADest was a well,
and when the day was very warm, the king’s child went out into the for\u00ADest and sat down by the side of the
cool foun\u00ADtain; and when she was bored she took a golden ball, and threw it up on high and caught it;
and this ball was her favor\u00ADite play\u00ADthing.
Normally, a word splitting algorithm takes care of inserting word splits into the input text. Word hyphenation is a tricky problem because it is not only necessary to check which hyphenations are possible, but also which should be omitted. The slightly modified algorithm generates the following improved line break.
Compared to the first example, this text contains more blue boxes, as plaything
, for example, now consists of two word parts in the last line, between which an invisible soft hyphen is positioned. This soft hyphen is only displayed if it is the last character of the line. As can be seen here at the end of the first line.
The word lime-tree
marked in red is not separated in this example, but this could happen with a different line width. If you want to prevent this, you can use additional Unicode characters. While the algorithm takes the hyphen and the soft hyphen into account, it treats the non-breaking hyphen as a normal character. Something similar happens with the non-breaking space, which looks like a space but is not taken into account when separating words.
Before we go into the special features of the Knuth Plass algorithm, a brief digression on text alignment. You often want the text to be left-aligned, sometimes centered and very rarely right-aligned, but most often justified. The text alignment itself is irrelevant for line breaks. The only question is always where the difference between the actual width of a line and the specified width ends up.
The simplest variant is left-aligned alignment, where the difference comes at the end and there is nothing to do. With centered alignment, we start with an offset of half the difference and with right-aligned alignment, the entire difference is used as an offset. Only justification behaves slightly differently, with it the difference is distributed over the width of the spaces.
How does the Knuth Plass algorithm differ from this ad-hoc algorithm, which already achieves good results? Two things in particular are decisive here. The Knuth Plass algorithm does not use a fixed space, but something called glue. Similar to a space character, it creates a space between words, but this space can be shrinked and expand to a certain extent. This allows lines to be produced that the ad-hoc algorithm cannot consider at all. In addition, the Knuth-Plass algorithm searches through all possible solutions and evaluates them by calculating a badness for each solution.
The ad-hoc algorithm is a greedy algorithm because it works with a local optimization. This local optimization is the filling of the current line and ensures that the algorithm always finds the solution with the least lines. However, these lines are often very unevenly filled. The Knuth Plass algorithm optimizes the line brea to the minimum of the squares of the filling errors of all lines in the paragraph. This allows the algorithm to provide a more pleasant typeface and also copes with lines that are slightly too long.
Another interesting aspect of the Knuth Plass algorithm is its dynamic programming approach. A brute force approach needs O(2^n). However, since the Knuth Plass algorithm collect already calculated successful partial solutions, it reduces to O(n^2) in the worst case and approaches O(n) in the normal case.
Here is a direct comparison of the two algorithms. The Ad-Hoc algorithm on the left and the Knuth Plass algorithm on the right. Both solutions produce paragraphs with the same number of lines. However, the last line is shorter in the Knuth Plass algorithm. The two words could be pushed into the previous line by a more compact and pleasing line breaking in the paragraph. Although the algorithm was published in 1981, it is still up to date and still produces better line breaks in \TeX than many other systems with their algorithms.