For the most recent documentations and versions of Hyphenator please go to http://code.google.com/p/hyphenator/
This Article treats in its first part the problem of word hyphenation in current webbrowsers. In its second part a javascript doing automatic word hyphenation is presented as a possible solution. This script relies on Liangs thesis and uses hyphenation patterns from the LaTeX-Distribution.
This document has been hyphenated automatically by the script. Change the size of th window to see the results.
English's not my first language. Please send corrections to . Thank you.
Dieser Artikel behandelt im ersten Teil die Problematik der Silbentrennung in Webbrowsern und führt im zweiten Teil als möglichen Lösungsansatz eine automatische Silbentrennung an. Diese beruht auf dem Ansatz von Liang, verwendet die Trennmuster für die deutsche Silbentrennung aus der LaTeX-Distribution und ist in JavaScript implementiert.
Text in current browsers is set ragged-left per default. With the CSS-property text-align:justify; it can be set in hyphenless justification. This leads to an ugly typeface with big spacings, particularly in narrow columns with long words.
Beside that justified text on webpages is controversial. Ragged-left text together with a short line length and generous line spacing is said to bring better readability because it leads the eye. So the eye doesn't get lost when jumping to a new line.
Generally the distance between text and eye is much bigger when reading texts on screen instead of reading printed texts. Furthermore you rarely use your index for orientation. To enhance readability of a screen text it is better to set it ragged-left. Nevertheless it may be desirable for printed webpages to be justified with hyphenation.
Last but not least hyphenation may also be useful in ragged-left text. Not only to break long words, but also to “calm” the text.

Hyphenation is a big issue for professional applications in print business. But things like optical margin alignment isn't very important in webdesign because of the much lower resolution of screens. Far from it! There is no|yet no hyphenation on webpages!
Automatic hyphenation is quite complex: It depends fairly on the particular language. English words are shorter on average then e.g. german words. Other languages behave completly different. In Thai aren't any spaces between words.
In most text processing applications lexical algorithms are used to find components and hyphenation points for each word. Words and components that can't be found on the list have to be hyphenated manually. The disadvantage of this approach is the extended use of storage space and the fact that unknown words (often compounds) can't be hyphenated.
Pattern based algorithms are the other side of the game. They compute patterns out of long lists of hyphenated words and later use this patterns to find hyphenation points. The most known algorithm of this kind is Franklin Mark Liangs (see Word Hy-phen-a-tion by Com-put-er), which has been developped 1983 and is used in apps like LaTeX and OpenOffice. Pattern based algorithms are - depending on the length of the initial word list - to find 90% of all hyphenation points. Although they can produce confusions with other words (such as “recover” vs. “re-cover”) wich have to be edited manually.
The HTML401-Standard says:
In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.
Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.
In HTML, the plain hyphen is represented by the "-" character (- or-). The soft hyphen is represented by the character entity reference ­ (­ or ­)
The Standard doesn't dictate that browsers have to support the ­-Entity. Nevertheless all actual browsers (Internet Explorer since version 5, Safari 2, Opera since version 7.1) except for Firefox 2.0 support it. (Firefox is going to support it with its upcoming version 3.0 wich is currently beta.)
Still, its up to the author to put ­ inside every word.
The disadvantage of using ­ that hyphenated words won't be found by the browsers Search-Function.
I also assume that search engines don't index hyphenated words.
With CSS 3 (today only CSS 2.1 ist actually usable) it will be possible to control how texts should be hyphenated. But consider that even CSS 3 doesn't say that browsers have to support it.
Although hyphenation isn't very important on webpages, it may lead to nicer layouts. Currently hyphenation has to be done by the author manually by putting ­ at the right places. There is no clientside automatic hyphenation.
With its comming version 3.0 even Firefox will support ­. An automated hyphenation is going to be more and more interesting. Until browsers have there own hyphenation-mechanism, hyphenation has to be done either during the authoring of the text (by the editor) or just before the page is delivered (by a script on the server) or in the users browser (by JavaScript).
Using JavaScript on the clientside has the big advantages that hyphenation can be turn on and off by the user and isn't seen by search bots.
The second part of this article shows a possible implementation in JavaScript.