[Python people: My question is at the very end :-)]
I want to use UTF-8 within C string literals for readability and easy maintainance. However, this is not universally portable. My solution is to create a file foo.c.in
which gets converted by a small perl script to file foo.c
so that it contains \xXX
escape sequences instead of bytes larger than or equal to 0x80.
For simplicity, I assume that a C string starts and ends in the same line.
This is the Perl code I've created. In case a byte >= 0x80 is found, the original string is emitted as a comment also.
use strict;
use warnings;
binmode STDIN, ':raw';
binmode STDOUT, ':raw';
sub utf8_to_esc
{
my $string = shift;
my $oldstring = $string;
my $count = 0;
$string =~ s/([\x80-\xFF])/$count++; sprintf("\\x%02X", ord($1))/eg;
$string = '"' . $string . '"';
$string .= " /* " . $oldstring . " */" if $count;
return $string;
}
while (<>)
{
s/"((?:[^"\\]++|\\.)*+)"/utf8_to_esc($1)/eg;
print;
}
For example, the input
"fööbär"
gets converted to
"f\xC3\xB6\xC3\xB6b\xC3\xA4r" /* fööbär */
Finally, my question: I'm not very good in Perl, and I wonder whether it is possible to rewrite the code in a more elegant (or more 'Perlish') way. I would also like if someone could point to similar code written in Python.