bash - Portable (cross platform) scripting with unicode filenames

Question

That's driving me crazy. Have the next bash script.

testdir="./test.$$"
echo "Creating a testing directory: $testdir"
mkdir "$testdir"
cd "$testdir" || exit 1

echo "Creating a file word.txt with content á.txt"
echo 'á.txt' > word.txt

fname=$(cat word.txt)
echo "The word.txt contains:$fname"

echo "creating a file $fname with a touch"
touch $fname
ls -l

echo "command: bash cycle"
while read -r line
do
    [[ -e "$line" ]] && echo "$line is a file"
done < word.txt

echo "command: find . -name $fname -print"
find . -name $fname -print

echo "command: find . -type f -print | grep $fname"
find . -type f -print | grep "$fname"

echo "command: find . -type f -print | fgrep -f word.txt"
find . -type f -print | fgrep -f word.txt

On the Freebsd (and probably on Linux too) gives the result:

Creating a testing directory: ./test.64511
Creating a file word.txt with content á.txt
The word.txt contains:á.txt
creating a file á.txt with a touch
total 1
-rw-r--r--  1 clt  clt  7  3 júl 12:51 word.txt
-rw-r--r--  1 clt  clt  0  3 júl 12:51 á.txt
command: bash cycle
á.txt is a file
command: find . -name á.txt -print
./á.txt
command: find . -type f -print | grep á.txt
./á.txt
command: find . -type f -print | fgrep -f word.txt
./á.txt

Even in the Windows 7 (with cygwin installed) running the script gives correct result.

But when i run this script on OS X bash, got this:

Creating a testing directory: ./test.32534
Creating a file word.txt with content á.txt
The word.txt contains:á.txt
creating a file á.txt with a touch
total 8
-rw-r--r--  1 clt  staff  0  3 júl 13:01 á.txt
-rw-r--r--  1 clt  staff  7  3 júl 13:01 word.txt
command: bash cycle
á.txt is a file
command: find . -name á.txt -print
command: find . -type f -print | grep á.txt
command: find . -type f -print | fgrep -f word.txt

So, only the bash found the file á.txt no, find nor grep. :(

Asked first on apple.stackexchange and one answer suggesting to use the iconv for converting filenames.

$ find . -name $(iconv -f utf-8 -t utf-8-mac <<< á.txt)

While this is works for the "OS X", but it is terrible anyway. (needing enter another command for every utf8 string what entering to the terminal.)

I'm trying to find an general cross platform bash programming solution. So, the questions are:

Why on the OS X the bash "found" the file and the find doesn't?

and

How to write cross-platform bash script where unicode filenames are stored in a file.
the only solution is write special versions only for OS X with the iconv?
exists portable solution for other scripting languages like perl and so?

Ps: and finally, not really programming question, but wondering what is rationale behind Apple's decision using decomposed filenames what doesn't play nicely with command line utf8

EDIT

Simple od.

$ ls | od -bc
0000000   141 314 201 056 164 170 164 012 167 157 162 144 056 164 170 164
           a   ́    **   .   t   x   t  \n   w   o   r   d   .   t   x   t
0000020   012                                                            
          \n

and

$ od -bc word.txt
0000000   303 241 056 164 170 164 012                                    
           á  **   .   t   x   t  \n                                    
0000007

so the

$ while read -r line; do echo "$line" | od -bc; done < word.txt
0000000   303 241 056 164 170 164 012                                    
           á  **   .   t   x   t  \n                                    
0000007

and outpout from a find is the same as ls

$ find . -print | od -bc
0000000   056 012 056 057 167 157 162 144 056 164 170 164 012 056 057 141
           .  \n   .   /   w   o   r   d   .   t   x   t  \n   .   /   a
0000020   314 201 056 164 170 164 012                                    
           ́    **   .   t   x   t  \n

So, the content of word.txt IS DIFFERENT what file is created from its content. Therefore, still havent explanation why the bash found the file.

score 5 · Accepted Answer

Unicode is hard. Repeat it every time you brush your teeth.

Your á.txt filename contains 5 characters, of which á is the troublesome one. There is more than one way to represent á as a sequence of Unicode code points. There's the precomposed representation, and the decomposed one. Unfortunately most software is not prepared to deal with characters, settling for code points instead (yes most software is cr*p). This means that given precomposed and decomposed representations of the same character, software will not recognize them as the same.

You have a precomposed á, represented as Unicode code point U+00E1 LATIN SMALL LETTER A WITH ACUTE. Windows uses the precomposed representation. Mac filesystems insist on the decomposed representation (well, mostly; utf-8-mac does not decompose certain character ranges, but á is decomposed OK). So on a mac your á becomes U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (writing off the top of my head, not having a Mac handy). Linux filesystems accept whatever you throw at them.

If you give find a precomposed á, it will not find a file with a decomposed á in its name, because it's not prepared to deal with this brouhaha.

So what's the solution? There isn't any. If you want to handle Unicode, you have to work around defects of the common tools.

Here's one slightly less ugly workaround. Write a small bash function (using iconv or whatever) that for each system will convert a representation acceptable on that system, and use it throughout. Let's call it u8:

find . -name $(u8 $myfilename) -print 
find . -name -type f -print | fgrep $(u8 $myfilename)

and so on. Pretty it's not, but it should work.

Oh and I think we all should start sending bug reports for this cr*p. Our software should eventually strive to understand basic human concepts like characters (I'm not even starting to talk about strings). Code points just don't cut it, sorry, even if they're Unicode code points.

score 2 · Accepted Answer

Creating the file with touch and testing its existence with [[ -e "$line" ]] uses the same encoding, so the file gets found.

Testing its existence using find -name and find -print seem to use different encodings. I propose to pipe the output of find -print into a hexdumper (xxd or od -x or similar). This will probably show you what encoding find uses when using -print (and this will probably also be used when using -name).

The general solution for encoding problems always is: USE JUST ONE ENCODING. In your case you should decide which point is easier to adopt; you can change the encoding at the creation of the file (touch "$(iconv -f utf-8 -t utf-8-mac <<< á.txt)") or similar) or change what you give to find (the solution given in your question already). Since bash itself seems to be coping well with the unicode filenames and only find seems to have this problem, I would also propose to do the necessary converting there. Maybe there even is a configuration option for the Mac OS find version which states which encoding it shall use for -name and -print commands.

bash - Portable (cross platform) scripting with unicode filenames

2 回答 2

Related

Reference