Hyphenation in HTML - Part 1

Project has moved

For the most recent documentations and versions of Hyphenator please go to http://code.google.com/p/hyphenator/

Abstract

This Article treats in its first part the problem of word hyphenation in current webbrowsers. In its second part a javascript doing automatic word hyphenation is presented as a possible solution. This script relies on Liangs thesis and uses hyphenation patterns from the LaTeX-Distribution.
This document has been hyphenated automatically by the script. Change the size of th window to see the results.

English’s not my first language. Please send corrections to . Thank you.

Many thanks to Darren P. Meyer for revising this article.

Dieser Artikel behandelt im ersten Teil die Problematik der Silbentrennung in Webbrowsern und führt im zweiten Teil als möglichen Lösungsansatz eine automatische Silbentrennung an. Diese beruht auf dem Ansatz von Liang, verwendet die Trennmuster für die deutsche Silbentrennung aus der LaTeX-Distribution und ist in JavaScript implementiert.

text-align:justify; vs. text-align:left;

Text in current browsers is set ragged-right (text-align:left;) by default. With the CSS-property text-align:justify; it can be set in hyphenless justification. This leads to an ugly paragraph with big spacings, particularly in narrow columns with long words.

Besides that, justified text on webpages is controversial. Ragged-right text, together with a short line length and generous line spacing, is said to bring better readability because it leads the eye. That is, the eye doesn’t get lost when jumping to a new line.

Generally the distance between text and eye is much bigger when reading on screen instead of on a printed page. Furthermore you rarely use your index finger for orientation while reading on-screen. To enhance readability of on-screen text, it is better to set it ragged-right. Nevertheless it may be desirable for printed webpages to be justified with hyphenation.

Last but not least, hyphenation may also be useful in ragged-right text — not only to break long words, but also to “calm” the text.

Animation shows the effect of hyphenation

Automatic Hyphenation

Hyphenation is a big issue for professional applications in the print business. But things like optical margin alignment aren’t very important in webdesign because of the much lower resolution of screens. Far from it! There is no|yet no hyphenation on webpages!

Automatic hyphenation is quite complex: It depends fairly on the particular language. English words are shorter on average than e.g. German words. Other languages behave completly differently; in Thai, there aren’t any spaces between words.

In most text processing applications, lexical algorithms are used to find components and hyphenation points for each word. Words and components that can’t be found on the list have to be hyphenated manually. The disadvantage of this approach is the extended use of storage space and the fact that unknown words (often compounds) can’t be hyphenated.

Pattern based algorithms are the other side of the game. They compute patterns out of long lists of hyphenated words and later use these patterns to find hyphenation points. The most known algorithm of this kind is Franklin Mark Liang’s (see Word Hy-phen-a-tion by Com-put-er), which was developped in 1983 and is used in apps like LaTeX and OpenOffice. Pattern based algorithms are — depending on the length of the initial word list — able to find 90% of all hyphenation points. However, they can produce confusion with other words (such as “recover” vs. “re-cover”) wich have to be edited manually.

Using hyphenation on webpages today

The HTML 4.01 Standard says:

In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the "-" character (- or-). The soft hyphen is represented by the character entity reference ­ (­ or ­)

The Standard doesn’t dictate that browsers have to support the ­ entity. Nevertheless most actual browsers (Internet Explorer since version 5, Safari 2, Opera since version 7.1, Firefox since version 3) support it. Still, its up to the author to put ­ inside every word.

The disadvantage of using ­ is that hyphenated words won’t be found by the browser’s Search function.
I also assume that search engines don’t index hyphenated words.

Hyphenation in tomorrows web

With CSS 3 (today, only CSS 2.1 is actually usable) it will be possible to control how texts should be hyphenated. But consider that even CSS 3 doesn’t say that browsers have to support it.

Conclusion

Although hyphenation isn’t very important on webpages, it may lead to nicer layouts. Currently hyphenation has to be done by the author manually by putting ­ at the right places. There is no client-side automatic hyphenation.

And now?

Since version 3.0 even Firefox supports ­. Automated hyphenation is going to be more and more interesting. Until browsers have their own hyphenation mechanism, hyphenation has to be done either during the authoring of the text (by the editor) or just before the page is delivered (by a script on the server) or in the user’s browser (by JavaScript).

Using JavaScript on the client side has the big advantages: that hyphenation can be turn on and off by the user and isn’t seen by search bots.

The second part of this article shows a possible implementation in JavaScript.