45. CONDITIONAL RANDOM FIELDS FOR WORD HYPHENATION

Department: Computer Science & Engineering
Faculty Advisor(s): Charles Elkan

Primary Student
Name: Nikolaos Trogkanis
Email: ntrogkan@ucsd.edu
Phone: 858-228-7110
Grad Year: 2011

Abstract
Finding allowable places in words to insert hyphens is an important practical problem. Given a word in a language such as English or Dutch, the task we investigate is predicting for each letter in the word whether or not it is permissible for the letter to be followed by a hyphen. This means that we tag each letter with either 1, for hyphen allowed following this letter, or 0, for hyphen not allowed after this letter. The goal in performing hyphenation is to predict a sequence of 0/1 values as a function of a sequence of input characters.

The first algorithms for the task were invented in the early 1960s. The algorithm that is used most often nowadays has been essentially unchanged for 25 years. This method is the TeX hyphenation algorithm of Knuth and Liang, which was considered a remarkable achievement in its day. We present now the first hyphenation method that is more accurate than the Knuth/Liang method.

Our method is an application of conditional random fields, a recent major advance in machine learning. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch. Our experiments show that both the Knuth/Liang method and a leading current commercial alternative have error rates over six times higher for both languages.

« Back to Posters or Search Results