# HOWTO: Using UTF-8 on Gentoo (edited)

## gna

HOWTO: Using UTF-8 on Gentoo

Changes

2004-05-10 Added Samba and Converting Filenames sections, edited Setting the Locale, recommended xorg-x11, added link to forums discussion and openi18n testsuites, small cleanup.

Introduction

Every now and again there is a flurry of interest in the forums or the mailing lists from people that want to use UTF-8 on Gentoo. Despite this interest and some information in the Gentoo Linux Localization Guide there are still quite a few questions and not much progress in moving Gentoo to a fully UTF-8 based distribution. My hope is that this HOWTO might increase the understanding of UTF-8 and related issues in Gentooland. There is a great deal that I don't know about this topic and I am hoping that this HOWTO might generate corrections and additional hints that will also improve my understanding of Linux UTF-8 issues.

Locales

The way to use UTF-8 is through locales. To do this you need glibc 2.2 or later and you also need to compile glibc with the nls USE flag set. It's set by default so you probably already have done this. Locales are the way that you specify that various aspects of your system should use local conventions. In this HOWTO we are only looking at the encoding aspect of locales.

If no locale is set then the system uses the default locale which is the C or POSIX locale.

Environment Variables

Locales are set up through environment variables. There are quite a range of locale related environment variables and they interact with each other somewhat. Most of them begin with LC_ and so are called the LC_* environment variables. The main ones to be concerned about are LANG, LC_CTYPE and LC_ALL. For a brief explanation of all of them see here and for all the gory details see here.

LANG is the variable normally used to set the locale. The LC_* variables (except LC_ALL) are used to modify parts of that locale but not others. You might want to do this if you were say a German living in Japan and wanted a basically German locale but to have dates in a Japanese format. You would do this by setting: 

```
export LANG=de_DE.UTF-8@euro

export LC_TIME=ja_JP.UTF-8
```

 This technique is often used with the LC_CTYPE variable by people who want to be able to input in another language especially Chinese, Japanese or Korean.

The LC_ALL variable overrides the values of LANG and the other LC_* variables. So if you set it there is no point in setting the others. Don't set LC_ALL for all users as then if a user wants to mix different locales together they also need to unset LC_ALL. Using LANG will cause less confusion.

The syntax for the locale environment variables is  

```
[language[_territory][.codeset][@modifier]]
```

 The meaning of the brackets is that that part is optional. However leaving off various parts can cause problems so it is better to always include every part except the modifier which is often not used.

When using more than one locale environment variable you must use the same encoding for every one. The shortened forms of locales have a default codeset and it usually isn't UTF-8 so this is one reason why you should always include the codeset part of the locale.

Commands

Two commands that are very helpful for setting up a UTF-8 locale are 

```
locale
```

 and 

```
localedef
```

 These commands are part of the glibc package and there are no man pages. Some documentation can be found in glibc-2.3.2.tar.bz2/glibc-2.3.2/localedata/README assuming you are using version 2.3.2 of glibc. 

You can list all the locales that glibc knows about using 

```
locale -a
```

 You can list your current locale settings using just 

```
locale
```

 The values of locale environment variables that have been explicitly set e.g. in an export statement (if using bash) are listed without double quotes. Those whose value has been inherited from other locale environment variables have their values in double quotes. Thus if you set only LANG=en_AU.UTF-8 and LC_CTYPE=ja_JP.UTF-8 then you will get:

```
gna $ locale

LANG=en_AU.UTF-8

LC_CTYPE=ja_JP.UTF-8

LC_NUMERIC="en_AU.UTF-8"

LC_TIME="en_AU.UTF-8"

LC_COLLATE="en_AU.UTF-8"

LC_MONETARY="en_AU.UTF-8"

LC_MESSAGES="en_AU.UTF-8"

LC_PAPER="en_AU.UTF-8"

LC_NAME="en_AU.UTF-8"

LC_ADDRESS="en_AU.UTF-8"

LC_TELEPHONE="en_AU.UTF-8"

LC_MEASUREMENT="en_AU.UTF-8"

LC_IDENTIFICATION="en_AU.UTF-8"

LC_ALL=
```

If the locale you would like to use doesn't already exist then you will need to make it. UTF-8 locales are not made by default so you will probably need to do this. Say you want to an Australian English UTF-8 locale then you would use the following command:

```
localedef -i en_AU -f UTF-8 en_AU.UTF-8
```

 Documentation for localedef can be found in the Single Unix Specification here. The locales are stored in /usr/lib/locale/locale-archive

At this point you might be wondering why 

```
locale -a
```

 lists the encoding part of the locale in lower case without any hyphens but these instructions always use UTF-8. The reason is that while glibc understands both forms of the name many other programs don't. The most common example of which is X. So it is best to always use UTF-8 in preference to utf8.

Setting the Locale

Where to set the environment variables is the next problem. Ideally every user on the whole system should use the same encoding i.e. UTF-8. However using a locale other than the "C" or "POSIX" locale for root is asking for trouble on Gentoo. The reason is that it can break emerges as the semantics of scripts etc can change. See the following bugs 8680, 38418 and 9988.

The Gentoo way to set environment variables is to set it in /etc/env.d. The Desktop Configuration Guide suggests setting the locale there. But this doesn't support setting a different locale for root. Some suggest suggest changing the locale in /etc/profile. It is possible to set the locale differently in /etc/profile for root and other users as is done with the PATH environment variable. Otherwise you can set them in $HOME/.bashrc or $HOME/.xinitrc if you only want the locale set under X.

X

X has it's own locale information. This is kept in the /usr/X11R6/lib/X11/locale directory. The localedef command only makes glibc locales it doesn't make X locales. Fortunately XFree86 has quite a few UTF-8 locales already set up but unfortunately they don't always work as well as they should. If you get a 

```
Gdk-WARNING **: locale not supported by Xlib
```

 error when running GTK/Gnome apps and a 

```
Qt: Locales not supported on X server
```

 when running Qt/KDE apps then you should dive into the /usr/X11R6/lib/X11/locale directory. Check that your desired locale is listed in locale.dir. The mapping between shortened names and full names for locales is in locale.alias. If those files are ok then something more fundamental is broken and I suggest you check out the XFree86 bugzilla or the Xorg equivalent depending on which one you are using.

If you are having difficulties getting your UTF-8 locale to operate correctly using XFree86 then I recommend switching to xorg-x11. Its support for UTF-8 locales seems to be much better. Gentoo is adopting xorg-x11 in preference to XFree86 but as of mid May 2004 xorg is not yet marked as stable. There are useful instructions on how to upgrade in the Documentation, Tips and Tricks forum.

Dual Booting Issues

If you are dual booting with Windows then you should read the kernel documentation in /usr/src/linux/Documentation/filesystems/ for your filesystems and kernel version to determine the correct options to put in /etc/fstab so that the file names are properly converted into UTF-8. For NTFS file systems you should use the nls=utf8 option. The utf8=<bool> option has been depreciated for quite a while but some man pages haven't caught up with this yet. For vfat filesystems on the other hand you should use the utf8=yes option.

Samba

Samba 2.2 doesn't support unicode so if you want filenames that use non ASCII characters to look the same on Windows and Gentoo with a UTF-8 locale then you need to upgrade to Samba 3. If your windows clients are unicode capable (Windows 2000 and XP, probably NT also but the Samba documentation isn't clear about it) then the Samba 3 configuration will convert the filenames to UTF-8 by default. If you have Windows 95, 98, Me or older clients then you need to set the "dos charset" option in the /etc/samba/smb.conf file to the appropriate codepage. See Chapter 26 of the Samba-3 HOWTO for more details. If you are upgrading from Samba 2 to Samba 3 then you will need to convert the filenames in your existing shares to UTF-8. Your filenames will be using the codepage of your Windows clients. If you are using English Windows 2000 or XP then your default codepage is probably 1252.

Converting Filenames

If you are sure all your filenames use only ASCII characters (codepoints < 128) then you don't need to bother as they will already be valid UTF-8. The simplest way to convert the encodings of filenames is to use the convmv utility written by Bjoern Jacke. This utility is in portage and will convert the encodings of a whole directory tree. So if your samba share is in the /samba directory you can do the following:

```
$ ACCEPT_KEYWORDS='~x86' emerge convmv

$ convmv -r -f cp1252 -t utf8 /samba
```

This will run a test and print what it is going to do. To actually do the conversion you need to add the --notest option. There is a man page which is quite helpful.

Resources

There are a number of useful resources about UTF-8 on linux. This pdf file gives a very good introduction. 

Markus Kuhn's UTF-8 for Linux FAQ is well known and helpful but is more oriented towards developers than users. 

Mike Fabian's CJK Support for SuSE is wide ranging, frequently updated and much of what is discussed is generally applicable to UTF-8 on any linux distribution. 

The xfree86 i18n mailing list is very helpful as is the linux-utf8 mailing list. 

There is a lot of detailed information in the Single Unix Specification.

One should not forget OpenI18N.org that was formerly LinuxI18n. They are setting standards for Opensource internationalization. They are also developing i18n testsuites for Linux and Unix.

Freedesktop.org have a UTF-8 promotion section.

----------

## Biggles

A very well written howto. 

I'm having trouble creating the locale for unicode though. I create it as root, since when I tried to create it as a normal user it complained about not having access. I don't think it worked quite right though. When I use the locale as root, im-ja-xim-server works fine, but when I use it as my normal user I get this message:

```
im-ja-xim-server

*** WARNING: Your locale (LC_CTYPE=en_NZ.UTF-8;LC_NUMERIC=C;LC_TIME=en_NZ.UTF-8;LC_COLLATE=en_NZ.UTF-8;

LC_MONETARY=en_NZ.UTF-8;LC_MESSAGES=en_NZ.UTF-8;LC_PAPER=en_NZ.UTF-8;

LC_NAME=en_NZ.UTF-8;LC_ADDRESS=en_NZ.UTF-8;LC_TELEPHONE=en_NZ.UTF-8;

LC_MEASUREMENT=en_NZ.UTF-8;LC_IDENTIFICATION=en_NZ.UTF-8) is non-UTF8, nor Japanese!
```

Any ideas? It makes me wonder if the locale got created properly when I did "localedef -i en_NZ -f UTF-8 en_NZ.UTF-8"

Also, are there any unicode fonts that need to be installed to get proper support for non-ansi characters?

----------

## gna

 *Quote:*   

> A very well written howto. 

 

Thanks

 *Quote:*   

> I'm having trouble creating the locale for unicode though. I create it as root, since when I tried to create it as a normal user it complained about not having access.

 

Sorry I forgot to mention that you must run localedef as root.

If it works as root but not as a normal user then that sounds like a permissions problem to me. Check the permissions in /usr/lib/locale/locale-archive and /usr/X11R6/lib/X11/locale and subdirectories. 

Also check the xserver log file for error messages /var/log/XFree86.0.log or /var/log/Xorg.0.log

It's possible that you might need to create the ja_JP.UTF-8 locale.

If none of this works setting LC_CTYPE=ja_JP.UTF-8 might help.

Silly question I know but you do have XMODIFIERS set correctly don't you? XMODIFIERS and the locale enviroment variables need to be set before X starts, not in an xterm window: in $HOME/.xinitrc is ok as are the other places mentioned in the HOWTO.

You will need japanese fonts. Note that modern fonts can work with multiple encodings. As you are interested in Japanese I particularly recommend Mike Fabian's CJK on Suse document (link in the resources section of the HOWTO). It is more oriented towards Japanese than Chinese and Korean and it has quite a bit of information about font issues.

----------

## Biggles

All the permissions were set to allow everyone to read for all those files, and there was nothing in the log. I tried creating the ja_JP.UTF-8 locale and setting C_TYPE, and that seemed to fix the problem. Now instead of complaining about locales, im-ja-xim-server freezes X instead. I recall hearing that it needs to be started before X or something, which is probably what's wrong (although putting it in my .xinitrc didn't help), but that's a problem for a different topic.  :Smile: 

----------

## ecatmur

Can I just say, thanks for this. I managed to work it out on my own a few months ago, but this would have saved me loads of time - and if more people posted on these forums in UTF-8, then we'd be able to read pages in mixed scripts without seeing (?) characters all over the place...

One thing you might add: when serving UTF-8 encoded pages from apache, you need to put "AddDefaultEncoding UTF-8" in the relevant config files (/etc/apache2/commonhttpd.conf etc.)

----------

## gna

 *Quote:*   

> One thing you might add: when serving UTF-8 encoded pages from apache, you need to put "AddDefaultEncoding UTF-8" in the relevant config files (/etc/apache2/commonhttpd.conf etc.)

 

I think you must mean "AddDefaultCharset utf-8" which is apparently in Apache  1.3.12 and later. I found one person who claimed this directive was bad because it overrode the charset in the <meta> element.

Do you use php? I came across this which has some information about php settings for UTF-8

----------

## AliceDiee

Thank you for this HowTo!!

Everything is working just fine except of those applications compiled against gtk1.2.x (sylpheed-claws and mplayer in my case)

German "Umlaute" aren`t displayed correct (only in the menus, the rest is ok) until I start them with

```
LC_ALL="de_DE@euro" programname
```

so I don't think it`s a problem of the used font, isn`t it?

----------

## gna

When you do this  *Quote:*   

> 
> 
> ```
> LC_ALL="de_DE@euro" programname
> ```
> ...

  you are changing the encoding to the default German encoding. That is iso8859-15.

According to the slypheed-claws FAQ. They have specified a default font with an iso8859-1 encoding so it isn't surprising that it doesn't work when you change to a UTF-8 locale. I think iso8859-1 is the same as iso8859-15 except for the euro symbol which is why the umlauts start working again when you switch to an iso8859-15 encoding.

I haven't figured out fonts very well yet but I think there are still quite a few older style fonts in X that are tied to only one encoding. It's only the newer formats e.g. truetype that can handle more than one encoding.

----------

## AliceDiee

Yeah, I already noticed that, but those font definitions don`t affect the menus! They only change the appearance of the message-tree, the overview and the messages themselfes.

Anyway it`s running now. I had to create a .gtkrc in my homedirectory with the following content

```
style "gtk-default" {

  fontset = "-*-helvetica-medium-r-normal-*-12-*-*-*-*-*-iso10646-1"

}

class "GtkWidget" style "gtk-default"
```

Choose any other font you like supporting iso10646-1 (maybe with xfontsel)

----------

## revertex

More info here:

http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html

with various example files to test

----------

## gna

 *Quote:*   

> I had to create a .gtkrc in my homedirectory with the following content

 

This is great! After googling for a bit it seems to me that this is definitely the way to go. I have found that that Mandrake, Suse and Fedora all use a similar solution.  There are two differences though. They put it in the file /etc/gtk/gtkrc.utf8 That fixes the problem for all users and doesn't cause any problems if a user isn't using a UTF-8 locale. The second difference is that they all specify different fonts. 

My feeling is that the best thing to do is to take the /etc/gtk/gtkrc.iso-8859-15 file and copy it to /etc/gtk/gtkrc.utf8 and then edit it. You should replace every occurence of 8859-15 in the file with 10646-1. This means that everything on your system will be the same as before but compatible with UTF-8.

Your solution looks like it would cause a problem if you went back to an iso-8859-15 locale.

----------

## AliceDiee

 *Quote:*   

> They put it in the file /etc/gtk/gtkrc.utf8 That fixes the problem for all users and doesn't cause any problems if a user isn't using a UTF-8 locale.

 

So far so good but since I`m using gtk-themes it isn`t working for me. Seems that this file gets overridden when you customize your look and feel!

----------

## agnitio

Hmm, I don't know if I've missed some basic thing. But I followed this guide and now I can't use non-english characters in terminals, it just results in strange looking characters (as utf-8 does when displayed on a non-utf8 system). In X everything seems to work fine so far though.

Do I have to set the consolefont to something special or what am I doing wrong?

----------

## AliceDiee

Try this in your /etc/rc.conf

```
CONSOLEFONT="lat9u-16"
```

----------

## agnitio

 *AliceDiee wrote:*   

> Try this in your /etc/rc.conf
> 
> ```
> CONSOLEFONT="lat9u-16"
> ```
> ...

 

Ah, thanks alot! That did the trick, now.. the only thing left is getting it to work in my terminal windows. I googled around a bit and found that aterm and eterm does not support utf8, is this correct? Anyway, I tried rxvt and I've tried setting it to fonts that should be supported and I've tried setting encoding to utf8 (wich started rendering chineese characters) and I can't seem to get it to work right. Xterm seems to have it working right away though but on startup I get this message:

```

xterm: Can't execvp "/usr/X11R6"/bin/luit: Filen eller katalogen finns inte

xterm: cannot support your locale.

```

I suppose that path is a bit weird because "luit" is right there in /usr/X11R6/bin/luit.

Thanks for a very nice guide!

----------

## AliceDiee

Maybe you want to give mlterm (multi-language terminal) a chance, it's in portage and supports utf8 and transparency   :Wink: Last edited by AliceDiee on Fri May 21, 2004 11:45 pm; edited 1 time in total

----------

## rounin

Thanks for mentioning convmv.

Using localedef to convert locales to UTF-8 is pretty easy when you know it, but wouldn't it be better if they were generated by default?

----------

## TecHunter

i just set my locale to UTF-8, but when i startup beep-media-player, it gets SIGSEGV... does beep-media-player not have UTF-8 support?

----------

## rounin

I've used beemp media player in an UTF-8 locale, so that should work. (Beep does have a tendency to crash, though.)

Are you sure you have the UTF-8 locale that you're trying to use?

----------

## vdboor

This is very cool!! Suddently a lot of gnome applications appear to have translated messages..  :Smile: 

Just a question, where could I set this environment variable for a user if he/she logs into KDE with KDM?

I've set the language for KDE-Applications to Dutch in the control panel, but this doesn't change the language of console applications (and GTK apps). ...how can I solve this?

----------

## rounin

export LANG=xx_XX

export LC_ALL=xx_XX 

Put it in the users' .bashrc I guess

----------

## vdboor

The .bashrc won't be read if kde starts. (only if you start a terminal window with bash)

Currently I've hacked a little in the "startkde" script, but it's a ugly hack:

```
if grep -q "Language=nl" ~/.kde/share/config/kdeglobals; then

    export LANG="nl_NL"

fi
```

at least it works :p

----------

## TecHunter

 *rounin wrote:*   

> I've used beemp media player in an UTF-8 locale, so that should work. (Beep does have a tendency to crash, though.)
> 
> Are you sure you have the UTF-8 locale that you're trying to use?

 

this is my locale -a | grep UTF-8 output:

```
en_US.UTF-8

zh_CN.UTF-8
```

i'm going to use zh_CN.UTF-8

but zh_CN.UTF-8 isn't in localedel --list-archive output...

----------

## ecatmur

 *AliceDiee wrote:*   

> Try this in your /etc/rc.conf
> 
> ```
> CONSOLEFONT="lat9u-16"
> ```
> ...

 

Hmm, I have to use unicode_start to get the UTF-8 characters to work...

----------

## skyfolly

I wish there is an ebuild on it.

----------

## gna

I think all the packages mentioned in the howto have ebuilds. Can you be a bit more precise about what kind of ebuild?

----------

## skyfolly

 *gna wrote:*   

> I think all the packages mentioned in the howto have ebuilds. Can you be a bit more precise about what kind of ebuild?

 

like chinese UTF-8 ones. the chinese UTF-8 locale can not be found on 

```
locale
```

either. Sorry, I am a bloody old newbie, dunno much about it, there is one article on transfering to Chinese UTF-8, but that guy seemed to fail it too. I am fed up with GB-2312 and Big5.

People have to transfer their fonts and locale to use UTF-8, fonts never display correctly.

----------

## Gatak

I have one problem with UTF-8. It is that I cannot mount a WindowsXP share with UTF-8. All extended characters come out very wrong, or simply missing.

But if I mount a Samba share from WindowsXP, UTF-8 works.

I tried with mount -o iocharset=utf8 with no luck.

EDIT: It works now with:

```
mount -t smbfs  -o iocharset=utf8,codepage=cp850
```

----------

## gna

Actually you are still using samba to mount your Windows XP partition. That is what the 

```
-t smbfs
```

 means. Without samba it should be 

```
-t ntfs
```

 or 

```
-t vfat
```

 depending on whether you are using an ntfs or a fat32 partition for windows.

Also I think you need to have the appropriate code page modules compiled as modules or built into your kernel for mount to be able to use iocharset correctly. See File Systems -> Native Language Support

----------

## Gatak

I think you are mistaking me what I wanted to do. I am not mounting a partition, but a Windows share over the network.

```
mount -t smbfs //windowsmachine/share /mnt/win -o username=blah,password=blah,iocharset=utf8,codepage=cp850
```

What is odd is that the codepage statement is needed. The purpose of Unicode is to provice a single universal characterset so no codepage translations will ever be nessesary between applications and systems.

----------

## gna

I have tried this on a Win2k share and am also having similar problems. 

Why did you chose cp850? 

Is cp850 the default codepage on your windows XP?

What is the default nls in your kernel?

thanks

----------

## Gatak

The codepage should be irrelevant when using UTF-8 (Unicode). This is the whole point with Unicode.

My default NLS in the kernel is UTF-8.

Windows XP and Windows 2000 are using Unicode for SMB shares, not single-byte codepages. This is why it is so strange when Samba required me to choose one.

cp850 is a "western latin-1" codepage so this is why I tested it. Windows 2000/XP  uses codepages for non-Unicode applications only.

Normally, a character is described as 8 bits. This makes it possible to have 256 different ones. Naturally. 256 characters aren't enough to describe all languages and all systems. Therefore codepages were developed so applications could know what the specific byte would be.

If two users were to talk to eachother over the net their systems would need to use the same codepage or characters would end up wrong.

Unicode was developed to remedy this. Unicode is large enough to be able to describe most (all?) languages in the world. Therefore the need for other codepage is removed. The biggest remaining problem is to have full unicode fonts. The fullest one I know is Arial Unicode MS. It has about 55000 characters defined.

----------

## gna

I agree that it should not be necessary to specify a codepage and, preferably, also no iocharset. It seems that that is the way it is intended to work. Why that is not working is either a bug or a configuration error.

Two more suggestions:

In the kernel configuration check

File Systems -> Network File Systems -> SMB File System support -> Use a default NLS -> utf8

It seems you can specify two default NLS's in the kernel, one for smbfs and one for other stuff.

Also try using the cifs filesystem. Just replace smbfs with cifs in your mount command (assuming it is configured in the kernel). cifs doesn't have a codepage option and is supposed to have better international support than smbfs. Cifs is now recommended over smbfs for all except old smb systems. Documentation is in /usr/src/linx/fs/cifs/README

If you can't get it to work then it might be good to ask a question on the linux cifs mailing list and/or file a bug report.

----------

## Leo Lausren

 *ecatmur wrote:*   

> Hmm, I have to use unicode_start to get the UTF-8 characters to work...

 

I made a script that echoes the \E%G to the terminals at boot, called /etc/init.d/unicode. It probably needs some work to be of general use.

```

#!/sbin/runscript

conf=/etc/env.d/02locale

# Using devfs?

if [ -e /dev/.devfsd ] || [ -e /dev/.udev -a -d /dev/vc ]; then

  device=/dev/vc/

else

  device=/dev/tty

fi

depend() {

        need localmount

        after keymaps

        before consolefont

}

checkconfig() {

  if [ -r ${conf} ]; then

          . ${conf}

          encoding=

          [ -n "${LC_ALL}" ]      && encoding=${LC_ALL#*.}   && return 0

          [ -n "${LC_MESSAGES}" ] && encoding=${LC_MESSAGES#*. } && return 0

          [ -n "${LANG}" ]        && encoding=${LANG#*.}   && return 0

  fi

  eend 1 "Locale is not configured, Please fix ${conf}"

  return 1

}

start() {

        ebegin "setting consoles to UTF-8"

        checkconfig

        if [ "${encoding}" = "UTF-8" -o "${encoding}" = "utf-8 " ]; then

                dumpkeys | loadkeys --unicode

                for ((i=1; i <= "${RC_TTY_NUMBER}"; i++)); do

                        echo -ne "\033%G" > ${device}${i}

                done

                eend 0

        else

                eend 1 "UTF-8 is not required"

        fi

}

```

----------

## max4ever

umm so if i did this 

```
linuxoid max # cat /etc/env.d/99locale

LANG=it_IT.utf8

LC_CTYPE=it_IT.utf8
```

 does this means now that anywhere in linux now i can see any character from any language if the terminal supports or the software UTF-8 ? i'm having problems getting my linux to show romanian specific letters in kde and mplayer...

----------

## Gatak

Only if the application you use has a font which includes these characters. And only if the application support UTF-8.

----------

## max4ever

hmm, and how can i find out if a font has "support" for those characters ? for example i'm having problems with mplayer showing correctly subtitles...,  can u suggest some font with utf8 support and antialias ?

----------

## Gatak

You can try to load the font in a character map program. I think there is one in Gnome. It allows you to see which characters exist in the font. Then you have to use that font in mplayer.

But remember, the subtitles that you load in mplayer may not be encoded with UTF-8, but some other local encoding. Mplayer would need to support that one.

----------

## andrewski

It'd be great if you could post a bit on the various fonts that are necessary to complete the effort to actually "see" UTF, i.e. console font, *term font.  In all my searching, I haven't been able to figure that one out!

Also, where does CONSOLETRANSLATION from /etc/rc.conf come in?  Perhaps that's necessary to seal the deal, as it were?

Thanks for a nice howto.

----------

## obmun

@andreskwi:

Forget about UTF-8 in console. It won't work completely (compose chars won't work). For more info take a look at this post. There I have some info about console font. Essentialy you have to use a console font with unicode map. Also it's good to have a font that makes use of the full 512 available gliphs (and not just one with 256).

CONSOLETRANSLATION tells setfont the translation map it will use to translate program output from 8 bit to the UTF-8 the kernel expects (kernel is always in UTF-8. It always execpts to recive unicode chars) when you're not using UTF-8. If apps are already sending UTF-8 chars it's not necessary to use the translation map and therefore CONSOLETRANSLATION should be commented out if you're using UTF-8 as your default coding.

----------

## talon

My major problem in porting my machine to utf-8 was that all gtk-1 apps didn´t display chars correctly. After a long time of experimenting I figured out how to do it right. You have to add the following line to your ~/.gtkrc.mine:

```

style "gtk-default" {

fontset = "-*-luxi sans-medium-r-normal--10-*-*-*-p-*-iso10646-1,\

-*-luxi sans-medium-r-normal--10-*-*-*-p-*-iso10646-1,\

-*-r-*-iso10646-1,*"

}

class "GtkWidget" style "gtk-default"

```

replace the "luxi sans" with your favorite font and the "10" with your preferable size. Even when you work with themes they won´t overwrite this file   :Very Happy:  .

----------

## Haqqax

Can anyone shed some light on how to force (or whether it can be done at all) KDE apps to work with Unicode Plane1 characters?

I have been testing a little the last two days. I managed to create a font with just  a few characters encoded in Plane1 (they start with 0x12000 - I am trying to make my linux support Akkadian cuneiform), I installed it and created with Perl a text file and HTML file for tests. HTML has both plain text chars and character entity references.

The only applications that processes and displays these files correctly are Firefox (it does display cuneiform texts  :Smile:  ) and Thunderbird (I did send a cuneiform e-mail to myself, and when it arrived it got displayed correctly  :Smile:  ) All the other applications, including but not limited to: OpenOffice, Konqueror and standard KDE apps do not parse UTF from Plane 1 correctly (they split one code into 2 chars) and of course do not display the text correctly. I am particularely disappointed by OpenOffice in this matter.

Can my KDE be cured? Does my success with Firefox and Thunderbird mean, that other GTK editors may work equally well?

----------

## gna

Actually this topic is of interest to me too. I know that a lot of applications ignore the supplementary planes. There is a  UTF-8 project at freedesktop.org that is trying to make a list of non unicode compliant software.  In particular they have a list of unicode software that doesn't work for the supplementary planes. Unfortunately this list is very short. But if you do find out something please report here and let us all know. 

What software did you use to make your font? It would be helpful to know so that more people know how to do testing.

thanks

----------

## Haqqax

 *Quote:*   

> What software did you use to make your font?

 

I used FontForge.

I was really surprised (in a positive sense) by this program. I like it very much.

I was not able to successfylly set up encoding for my font from within the user interface - I just opened SFD file with Vim and updated the encoding manually:

```

Encoding: unicode4

UnicodeInterp: none

```

It is a new program to me, maybe some other settings are also important. I noticed (try and error  :Smile:  ) that if you make a mistake in "Encoding", SourceForge will change it to "Custom"

I am  still reading about the file format.

 *Quote:*   

> list of unicode software that doesn't work for the supplementary planes

 

They only list Vim and Emacs?  I would say Vim does better job than KDE editors and OpenOffice. I wonder whether it would not work if I had proper console font. I can only see that Vim does know how many characters I have - it displays question marks instead of them, and it has no other choice because I only have truetype font for my encoding. OpenOffice 1.1.2 did not get that far. I am upgrading today to 1.1.3.

I do not have Emacs to test. I think one might try to use Thunderbird's editor to edit these texts (sooner or later other editors will support Plane1 too). I will investigate this if I have some time  :Smile:  The other solution may be to build console font and check whether medit can be used for editing. Building IME for medit is extremely easy. I think this approach would be successful - but it does not meet my goal. 

I would like to use cuneiform just like I use Chinese - not to have to do a magick dance with special macros, hacking too much with fonts and having to use specialized editors. I want to open all the files in editors I use for everyday work and input them with IMEs I normally use.

----------

## numerodix

Ok, so I finally succeeded in getting this to work, my /etc/env.d/02locale now looks like this:

```
LC_CTYPE="no_NO.utf8"

LANG="en_US.utf8"
```

After restarting X (you may want to mention that without restarting it just won't work) I was relieved to find out that apparently both qt and gtk now recognize the character set, filenames displayed correctly in konqueror etc. It looks like the apps that I use in X are working fine in this respect.

What is still missing is unicode support in the console, that is outside of X. I'm not exactly sure what it takes to get filenames to display correctly, sometimes I have to run unicode_start, sometimes it seems to work without it. But input is still not working, that is the keys æøå. My /etc/rc.conf looks like this:

```
KEYMAP="no-latin1"

CONSOLEFONT="lat0-16"

CONSOLETRANSLATION="8859-1_to_uni"
```

While I use X 98% of the time, it's a little problematic to have this bug if anything thas to be done from the shell. Any ideas?

[edit]The euro symbol is not working either, whatever I've done I've never been able to activate it.

----------

## Haqqax

 *gna wrote:*   

> But if you do find out something please report here and let us all know. 

 

Well, I did some additional tests and the results are very good.

I made a test IME for my Akkadian font in SCIM and IT WORKS. I can write Akkadian just like Chinese!

This can be usable in academic projects. If I send you a TTF font and you install it, I will gain the possibility to send you emails in Akkadian. Thunderbird will display them for you, you can save text files correctly,etc. And with SCIM, you can also write Akkadian back to me. If there only was an word-processing application, it would be so easy to write books, prepare tests for students etc.

As I said Firefox works well with Plane1 (only deleting is a little broken - you have to backspace each character twice, as it happend sometimes woth chinese on English systems in the old days - ie. not all the bytes of the character are deleted at once). So, if PostgreSQL is Plane1 ready (I did not check yet) we might start to collaborate on some Akkadian data (dictionary, book, text repository - and not only Akkadian) already encoded in the future standard (UNICODE did not accept the sumero-akkadian cuneiform encoding yet) just like we can with English. We have everything in place. Even if the encoding will finally change, it would be a matter of minutes to write the script to fix the existing texts. I think I could build such collaboration platform to be usable in a week - if someone would donate glyphs for the cuneiform font (I think the beginning might be the fonts created for TeX by Mr Piska.  Or one might buy fonts from Michel Everson  :Smile:  ).

Well, the only problem now seems to be the retarded language support in Qt and KDE. I am extremely frustrated by this. Can someone write how non-BMP encodings are supported in GNOME applications?

PS: OpenOffice 1.1.3 is no better than 1.1.2 with support for Plane1 characters.

----------

## Gatak

It think most Gnome applications support Unicode. At least if compiled in with accessibility support. In GEdit, for example, I can view all sorts of Unicode characters. I suppose I still need truetype or opentype fonts in system that support Unicode.

----------

## Haqqax

 *Gatak wrote:*   

> It think most Gnome applications support Unicode. At least if compiled in with accessibility support. In GEdit, for example, I can view all sorts of Unicode characters. I suppose I still need truetype or opentype fonts in system that support Unicode.

 

To be clear - there is no problem with BMP in KDE (chinese, IPA, arabic without vowels) - so Unicode is supported. I am interested in support for codes beyond 0xFFFF

----------

## Haqqax

I've got one more question: are Hebrew niqud and Arabic vowels displayed correctly on your Gentoo boxes? On my box they are displayed, but are not positioned correctly on their characters.

And, of course, arabic ligatures are broken by the vowels.

Is it working for anyone?

----------

## obmun

@numerodix:

Console and UTF-8? Bad mixture. Take a look at this post. There I analize the problem. Conclusion? It's a kernel problem.

----------

## numerodix

 *obmun wrote:*   

> @numerodix:
> 
> Console and UTF-8? Bad mixture. Take a look at this post. There I analize the problem. Conclusion? It's a kernel problem.

 

Yes thanks, I actually saw that one a little while ago. I can confirm what you said about jagged input, that's the only malfunction I have now.

But on to something else, I've set my locale according to this thread and everything seems to work quite well. One question in relation to kmail.. I get an email where the specific Norwegian characters are displayed as boxes. Then I click reply and get a compose window, now they show up fine. What's the deal? The email header follows.

```
MIME-Version: 1.0

Content-Type: multipart/alternative;

  boundary="----=_NextPart_000_0005_01C4AF97.EF828F50"

X-Priority: 3

X-MSMail-Priority: Normal

X-Mailer: Microsoft Outlook Express 6.00.2900.2180

X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180

This is a multi-part message in MIME format.

------=_NextPart_000_0005_01C4AF97.EF828F50

Content-Type: text/plain;

 charset="iso-8859-1"

Content-Transfer-Encoding: quoted-printable

<snip>
```

----------

## foosh

hmm

----------

## noathustra

This is a great Howto and discussion. Well done. Perhaps someone here can help with my Gecko browser locale difficulties. When I use Firefox/Mozilla/Epiphany on certain sites, the browser segfaults. I ran strace on it and found that just before it crashed there were problems finding the file:

/usr/share/locale/en/LC_MESSAGES/libc.mo

I am using en_US.utf for LC_ALL and ja_JP.utf for LC_CTYPE. 

Any ideas how to generate the libc.mo file?

----------

## Phase_

Hey,

I'm having some trouble getting UTF-8 to work properly with GTKmm.

UTF-8 support is there, as can be seen through Glib::get_charset ().

But whenever I try to run locale_to_utf8 () with a UTF-8 formatted string, the app. just dies with the error "Aborted".

I have tried editing added /etc/gtk/gtkrc.utf8 and ~/.gtkrc, nothing seems to work.

Anyone got a clue?

----------

## SaFrOuT

i don't knwo if i am that dumb but after reading this article i can't figure out how to change my encoding fro Posix to UTF8

i need to read and write arabic doc sometimes sice i am an arabic user 

can u help me in an easier language plz from those outputs 

```

home safrout # locale

LANG=

LC_CTYPE="POSIX"

LC_NUMERIC="POSIX"

LC_TIME="POSIX"

LC_COLLATE="POSIX"

LC_MONETARY="POSIX"

LC_MESSAGES="POSIX"

LC_PAPER="POSIX"

LC_NAME="POSIX"

LC_ADDRESS="POSIX"

LC_TELEPHONE="POSIX"

LC_MEASUREMENT="POSIX"

LC_IDENTIFICATION="POSIX"

LC_ALL=

home safrout # locale -a

C

POSIX

ar_EG

ar_EG.cp1256

ar_EG.ibm864

ar_EG.iso88596

ar_EG.utf8

en_US

en_US.cp1252

en_US.iso88591

en_US.utf8

home safrout #             

```

i use KDE-3.3.1 , kernel 2.6.9-ck1 and Xfree

----------

## anderlin

I want to convert all my filenames to UTF-8, and want to use convmv. How do I find out what encoding to encode from?

----------

## >Octoploid<

 *anderlin wrote:*   

> I want to convert all my filenames to UTF-8, and want to use convmv. How do I find out what encoding to encode from?

 

-f iso88591 

should work...

----------

## bmichaelsen

http://gentoo-wiki.com/HOWTO_Make_your_system_use_unicode/utf-8

----------

## MaxDamage

When burning a CD with UTF8 filenames, the filenames are wrongly displayed when accessing the CD from a windows machine.

Any way of forcing the program to burn the CD using iso8859-15?

----------

## tomga

 *MaxDamage wrote:*   

> When burning a CD with UTF8 filenames, the filenames are wrongly displayed when accessing the CD from a windows machine.
> 
> Any way of forcing the program to burn the CD using iso8859-15?

 

got the same problem. tried several configs but no solution. seems windows xp not capable of reading proper utf.

i also got a problem with ascii mails and kmail. 

when getting a pure ascii mail with german umlaute I got those little squares instead of the right character (encoding -> auto).

the only way to avoid this is that I change encoding form auto to ISO-8895-15, then i get all the umlaute in the asscii mails but all utf-8 mails are wrong.

anyone got this solved?

----------

## bi3l

You can try '-input-charset utf8' as an mkisofs option.

----------

## AFCommando

Hi all,

I can't say how much this thread has helped me with using UTF-8.  This thread is awesome  :Very Happy: 

Everything is running fine for me except for one thing, and thats viewing japanese filenames over samba.  If I ssh into my fileserver and do ls, I see it just fine but if I ls the share I won't see it correctly. My temporary fix that I find strange is having 

unix charset = CP850

in my smb.conf for some reason that makes things work.  When I mount the share on my desktop I don't need any options for smbmount.  Even if i tried iocharset=utf8,codepage=cp850 it doesn't work unless i added that unix charset to my smb.conf.  Anyone have any ideas why this is happening? Also if I don't have unix charset = CP850, my win2k box displays the filenames correctly but if that option is there, it doesn't.

Both my desktop and my fileserver have NLS UTF-8 in the kernel and i set SMBFS NLS to UTF-8 also.

Any help would be appreciated.

----------

## lordello

I'm sorry to post in this old topic, but i want to keep this information together for later research.

When i use UTF-8 locale, i need to set the unicode use flag?

Have any other flag to set?

I need to recompile programs other than that have unicode flag?

I know that this question can be stupid, but it's not clear to me.

Thanks.

----------

## supermihi

One question: is the @euro modifier needed for UTF8 locales?

As UTF-8 should be a superset of iso8859-15, could it be that it's not needed? I'm asking because I found that there was no entry for de_DE.UTF-8@euro in /usr/lib/X11/locale/locale.alias, so I still got the Xlib errors.

----------

## carpman

Hello so how would i go about creating

```

/usr/lib/X11/locale/en_GB.UTF-8 

```

tried

```

localedef -i en_GB -f UTF-8 en_GB.UTF-8

```

I am only getting problems with one app, tomboy, which is giving error

```

 tomboy

(Tomboy:18041): Gdk-WARNING **: locale not supported by Xlib

(Tomboy:18041): Gdk-WARNING **: cannot set locale modifiers

Binding key '<Alt>F12' for '/apps/tomboy/global_keybindings/show_note_menu'

Binding key '<Alt>F11' for '/apps/tomboy/global_keybindings/open_start_here'

The program 'Tomboy' received an X Window System error.

```

----------

## devil_ua

http://saber.gentoo.org.ua/~devil/unicode-guide.html

----------

## Cintra

ref page 1 of this thread

 *Quote:*   

>  If you get a
> 
> Code:
> 
> Gdk-WARNING **: locale not supported by Xlib
> ...

 I have the above warnings in a few places, but what the above quote 'skates past' is the case where, for example en_DK has no entry in either the locale.dir or locale.alias files... what then?

Any ideas on this?

----------

## bludger

This is an annoying "feature" of xlib.  en_DK does not seem to be supported in xlib. I guess the only workaround is to map en_DK to some supported locale in the locale.dir file.  This is clearly sub optimal as you probably lose the features of en_DK that you wanted in the first place. 

Does anyone know of a way to edit xlib locales? It must be somewhere in the code.  The current locale system probably works ok for en_US users or en_GB, but for the large number of international english speaking expats, it is a pain in the arse.  What is needed is a simple way of cusomising the locales for the system and for xlib, so that you can have a mix of English language and european currency and european date format (with english words).  It is possible to do this for the base system, but no way that I know of of customising the xlib locales.  

Does anyone know a way of doing this?

----------

