python - Good Perl style: How to convert UTF-8 C string literals to \xXX sequences

Question

[Python people: My question is at the very end :-)]

I want to use UTF-8 within C string literals for readability and easy maintainance. However, this is not universally portable. My solution is to create a file foo.c.in which gets converted by a small perl script to file foo.c so that it contains \xXX escape sequences instead of bytes larger than or equal to 0x80.

For simplicity, I assume that a C string starts and ends in the same line.

This is the Perl code I've created. In case a byte >= 0x80 is found, the original string is emitted as a comment also.

use strict;
use warnings;

binmode STDIN, ':raw';
binmode STDOUT, ':raw';


sub utf8_to_esc
{
  my $string = shift;
  my $oldstring = $string;
  my $count = 0;
  $string =~ s/([\x80-\xFF])/$count++; sprintf("\\x%02X", ord($1))/eg;
  $string = '"' . $string . '"';
  $string .= " /* " . $oldstring . " */" if $count;
  return $string;
}

while (<>)
{
  s/"((?:[^"\\]++|\\.)*+)"/utf8_to_esc($1)/eg;
  print;
}

For example, the input

"fööbär"

gets converted to

"f\xC3\xB6\xC3\xB6b\xC3\xA4r" /* fööbär */

Finally, my question: I'm not very good in Perl, and I wonder whether it is possible to rewrite the code in a more elegant (or more 'Perlish') way. I would also like if someone could point to similar code written in Python.

score 4 · Accepted Answer

I think it's best if you don't use :raw. You are processing text, so you should properly decode and encode. That will be far less error prone, and it will allow your parser to use predefined character classes if you so desire.
You parse as if you expect slashes in the literal, but then you completely ignore then when you escape. Because of that, you could end up with "...\\xC3\xA3...". Working with decoded text will also help here.

So forget "perlish"; let's actually fix the bugs.

use open ':std', ':locale';

sub convert_char {
   my ($s) = @_;
   utf8::encode($s);
   $s = uc unpack 'H*', $s;
   $s =~ s/\G(..)/\\x$1/sg;
   return $s;
}

sub convert_literal {
   my $orig = my $s = substr($_[0], 1, -1);

   my $safe          = '\x20-\x7E';          # ASCII printables and space
   my $safe_no_slash = '\x20-\x5B\x5D-\x7E'; # ASCII printables and space, no \
   my $changed = $s =~ s{
      (?: \\? ( [^$safe] )
      |   ( (?: [$safe_no_slash] | \\[$safe] )+ )
      )
   }{
      defined($1) ? convert_char($1) : $2
   }egx;

   # XXX Assumes $orig doesn't contain "*/"
   return qq{"$s"} . ( $changed ? " /* $orig */" : '' );
}

while (<>) {
   s/(" (?:[^"\\]++|\\.)*+ ")/ convert_literal($1) /segx;
   print;
}

score 3 · Accepted Answer

Re: a more Perlish way.

You can use arbitrary delimiters for quote operators, so you can use string interpolation instead of explicit concatenation, which can look nicer. Also, counting the number of substitutions is unneccessary: Substitution in scalar context evaluates to the number of matches.

I would have written your (misnomed!) function as

use strict; use warnings;
use Carp;

sub escape_high_bytes {
  my ($orig) = @_;

  # Complain if the input is not a string of bytes.
  utf8::downgrade($orig, 1)
    or carp "Input must be binary data";

  if ((my $changed = $orig) =~ s/([\P{ASCII}\P{Print}])/sprintf '\\x%02X', ord $1/eg) {
    # TODO make sure $orig does not contain "*/"
    return qq("$changed" /* $orig */);
  } else {
    return qq("$orig");
  }
}

The (my $copy = $str) =~ s/foo/bar/ is the standard idiom to run a replace in a copy of a string. With 5.14, we could also use the /r modifier, but then we don't know whether the pattern matched, and we would have to resort to counting.

Please be aware that this function has nothing to do with Unicode or UTF-8. The utf8::downgrade($string, $fail_ok) makes sure that the string can be represented using single bytes. If this can't be done (and the second argument is true), then it returns a false value.

The regex operators \p{...} and the negation \P{...} match codepoints that have a certain Unicode property. E.g. \P{ASCII} matches all characters that are not in the range [\x00-\x7F], and \P{Print} matches all characters that are not visible, e.g. control codes like \x00 but not whitespace.

Your while (<>) loop is arguably buggy: This does not neccessarily iterate over STDIN. Rather, it iterates over the contents of the files listed in @ARGV (the command line arguments), or defaults to STDIN if that array is empty. Note that the :raw layer will not be declared for the files from @ARGV. Possible solutions:

You can use the open pragma to declare default layers for all filehandles.
You can while (<STDIN>).

Do you know what is Perlish? Using modules. As it happens, String::Escape already implements much of the functionality you want.

score 1 · Accepted Answer

Similar code written in Python

Python 2.7

import re
import sys

def utf8_to_esc(matched):
    s = matched.group(1)
    s2 = s.encode('string-escape')
    result = '"{}"'.format(s2)
    if s != s2:
        result += ' /* {} */'.format(s)
    return result

sys.stdout.writelines(re.sub(r'"([^"]+)"', utf8_to_esc, line) for line in sys.stdin)

Python 3.x

def utf8_to_esc(matched):
    ...
    s2 = s.encode('unicode-escape').decode('ascii')
    ...

python - Good Perl style: How to convert UTF-8 C string literals to \xXX sequences

3 回答 3

Similar code written in Python

Related

Reference