MH & nmh: Email for Users & Programmers

May, 2006

International Character Support

MH 6.8 and above have support for "international" characters -- that is, non-English characters. This is distinct from the MIME support. Support is enabled by [LOCALE] configuration option (see the Section The -help Switches).

For C programmers, here's a typical change that [LOCALE] makes. This is from the file sbr/gans.c:

    #ifdef LOCALE
            i = (isalpha(i) && isupper(i)) ? tolower(i) : i;
            if (i >= 'A' && i <= 'Z')
                i += 'a' - 'A';

Once you get your POSIX-compliant system set up correctly, MH and programs it calls will behave more naturally. For example, when you use the vi editor command for "next word," the cursor won't stop in the middle of a word at a non-ASCII character. Programs like grep(1) and sort(1) should understand how to handle the characters in your language.

As with all POSIX internationalization, though, the character support in MH is system-dependent. Don't expect everything to work perfectly. And the setup varies from system to system; check your documentation or ask a local expert. Various manual pages to try on HP-UX are: environ(5), setlocale(3), and hpnls(5). Read locale(5) and setlocale(3) on SunOS.

For the most complete setup, you should know about the LANG environment variable. The full syntax for the value of LANG is:


The brackets [] mark optional parts; don't include the brackets when you set the variable. An example setting of LANG is:

Language is the only parameter that is (almost) consistent across platforms; it is used to find the databases for all the locale categories. HP-UX uses the full syntax (the "modifier" might even be an HP-UX addition). SunOS uses the language as others use the codeset. SCO always uses the territory as well.

The locale categories are set by the environment variables LC_COLLATE (string collationi and sorting), LC_CTYPE (character classification and conversion, such as "is this character `printable'?"), LC_MONETARY (monetary formatting), LC_NUMERIC (for input and output of numbers), LC_TIME (time conversion), and LC_MESSAGES (messages to the user -- this isn't on all platforms).

For much of your email-related work, you may choose to set only LC_CTYPE. This won't change the way most tools behave in ways other than handling characters. Another advantage of not setting all categories is that incomplete implementations won't give warning messages when they don't support a particular setting.

If you're trying to choose a good value for those environment variables and no one else in your organization has already found good settings, look in the databases. The databases are usually located under /usr/lib/locale, /usr/share/lib/locale or (in the case of SunOS) /etc/locale.

As an example, the Table below has the settings that Kimmo Suominen (from Finland, working in New York) uses on different platforms:

Table: Sample LANG settings

    Platform   Setting             Comment

    SVR4       finnish             --
    HP-UX      american.iso88591   okay, finnish.iso88591
    SunOS      iso_8859_1          yes, note the underscores
    SCO        english_us.88591    has the territory