# [tex-live] Hyphenation patterns, Unicode, XeTeX, and language.dat

Jonathan Kew jonathan_kew at sil.org
Thu Aug 17 14:51:35 CEST 2006

(Sorry, long message! See end for specific changes proposed in TL.)

I'd like to explore solutions for the problem of loading all the
various hyphenation patterns in TeX Live when running the XeTeX
engine (and using Unicode-compliant fonts). This relates primarily to
LaTeX, though the techniques here could be used by other formats
(Plain-based or other) too, and may also be helpful as other engines
move towards greater Unicode support.

First, what exactly are the issues? There are a couple of reasons why
a working "xelatex" format cannot be built using the existing
language.dat and pattern files found in TL today:

(1) Patterns are loaded according to a specific font encoding. This
is how TeX works: the hyphenation rules are applied to sequences of
font-specific character codes. In XeTeX, we focus on Unicode as the
current standard for character encoding, but the patterns found in TL
are designed for various 8-bit font encodings used in the traditional
TeX world. Therefore, for correct hyphenation of Unicode text, it
will be necessary to re-encode the patterns to Unicode character
codes (except in cases, such as English, where the 8-bit character
codes used already correspond to Unicode values).

(2) Some of the pattern files are stored in pure 7-bit ASCII, using
escape sequences where it is necessary to represent non-ASCII
characters; but others are stored in 8-bit encodings such as TeX T1,
T2a, etc. Because XeTeX defaults to reading input text as UTF-8
Unicode, byte values >=128 in such files will be taken as part of
UTF-8 sequences, so special care is needed to interpret such files
correctly.

While a "global" clean-up/harmonization of pattern files, looking at
usage, etc., would be a Good Thing (IMHO), this would clearly be a
long-term project, involving interaction with numerous original
authors or maintainers (some of whom may be difficult to track down,
or have little current interest). I'd like to see this addressed, but
at this time I want to tackle the more immediate problem of making
things work in TeX Live, given the collection of pattern files we
have today.

My current plan, therefore, is to leave the actual pattern files
untouched, and provide "wrapper" files that can load them with
appropriate settings for XeTeX, setting the input text encoding and
remapping character codes to Unicode as needed.

As an example, consider the file "xu-sihyph.tex". (The "xu-" prefix
is intended to suggest XeTeX and Unicode, though as other Unicode
engines become available, this may be extended to support them.)
Details vary for other wrappers, depending on exactly how the pattern
file is written and what character coding it assumes, but the general
idea remains the same.

--------------------------------------
% xu-sihyph.tex
% Wrapper for XeTeX to read sihyph.tex
% Jonathan Kew, 2006-08-17

\begingroup

\input ifxetex.sty
\ifxetex
% Define the accent macro " to expand to the required Unicode
characters
\catcode\"=13
\def"#1{\ifx#1c^^^^010d\else \ifx#1s^^^^0161\else \ifx#1z^^^^017e
\else
\errmessage{Hyphenation pattern file corrupted!}%
\fi\fi\fi}
\catcode\"=12 % reset catcode so we can read \lccode etc in
sihyph.tex
%
\let\PATTERNS=\patterns
\def\patterns{% at the \patterns command in sihyph.tex...
\endgroup % end group to discard definitions from sihyph
\begingroup % and start our own (to match \endgroup in sihyph)
\lefthyphenmin=2 \righthyphenmin=3 % settings from sihyph.tex
\catcode\"=13 % activate our definition of " from above
\PATTERNS % and then load the real patterns
}
\fi

\input sihyph.tex

\endgroup
\endinput
--------------------------------------

This allows the existing Slovenian patterns to be loaded in XeTeX and
applied to Unicode text. So when creating the xelatex format, we need
to use a version of language.dat that refers to "xu-sihyph.tex"
instead of the original "sihyph.tex", and similarly for many of the
other languages.

However, I want to avoid actually maintaining a second copy of
language.dat for XeTeX (and figuring out where to put it, so that
each engine will load the right one); this seems like a recipe for
confusion, as well as complicating things for texconfig or other
tools. Users should be able to set a *single* collection of language
choices for LaTeX (or other formats), regardless of which TeX engine
they're using at a particular moment.

To allow this, the wrapper file uses ifxetex.sty (from texmf-dist/tex/
generic/ifxetex/) to check whether it is being processed by XeTeX. If
so, it remaps characters to Unicode as needed, and discards unneeded
definitions from the pattern file; but if read by a standard TeX
engine, it will simply load the old pattern file without changing
anything.

Therefore, it is valid for language.dat to refer to the "xu-" wrapper
file *in all cases*, and the patterns will be loaded in "legacy" mode
(for whatever font encodings they happen to support) by [pdf]tex
engines, and as Unicode by xetex.

** Proposal **

I have begun to write "xu-___.tex" wrapper files for the patterns
currently available in TL (most are trivial), to allow xetex to load
the existing (non-Unicode) files. I suggest that these wrappers go
into texmf/tex/generic/xu-hyphen (as a sibling directory to texmf/tex/
generic/hyphen).

Then we modify the "language.__.dat" files in texmf/tex/generic/
config to refer to the xu- wrapper files (in the cases where one is
necessary), and the pre-built "language.dat" will change similarly.

The net result will be that standard 8-bit TeX will load exactly the
same patterns as it currently does (it'll just do some extra \input
operations during format creation, but this is insignificant), and
XeTeX will load the same set of patterns, but encoded for use with
Unicode text.

Before actually making changes to something as central as
language.dat, however, I'd like to hear any concerns or objections to
this proposed strategy, or alternative suggestions that could make
things simpler for us all.

Thanks in advance for any and all feedback!

-- JK

`