Non-Latin characters in archive files

I was recently moving files whose names contained non-Latin characters between Linux and Windows. Using Samba (version 3.4), everything worked fine. Yet creating a ZIP or TAR archive on Linux and extracting it on Windows resulted in mangled filenames. As it turns out, this is an expected behavior: neither ZIP nor TAR format allows including character encoding information in the archive. As a result, creating an archive in a system which uses e.g. UTF-8 (most modern Linux systems) and extracting it in Windows causes weird characters to appear in filenames due to the lack of conversion between UTF-8 and Windows’ codepage during extraction. The same goes for creating an archive in Windows and extracting it in Linux. The solution is to use a format which saves character encoding information in the archive (e.g. 7z) or to use Samba (which is able to automatically convert filenames between different encodings) for moving the data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s