json - Normalization on utf8 filenames stored in JSON with perl

Question

I have two Json files which come from different OSes.

Both files are encoded in UTF-8 and contain UTF-8 encoded filenames.

One file comes from OS X and the filename is in NFD form: (od -bc)

0000160   166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
           v   e   t   l   a    ́  **   /   H   o   u   s   e       m   e

the second contains the same filename but in NFC form:

000760   166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
           v   e   t   l   á  **   /   H   o   u   s   e       m   e   s

As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize for handling it.

I'm reading both files with the next:

my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;

The read_file is from File::Slurp and decode_json from the JSON::XS.

Reading the JSON into perl structure, from one json file the filename comes into key position and from the second file comes into the values. I need to search when the hash key from the 1st hash is equvalent to a value from the second hash, so need ensure than they are "binary" identical.

Tried the next:

 grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc

and

 grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc

produces for me the same output.

Now the questions:

How to simply read both json files to get the same normalization into the both $hashrefs?

or need after the decode_json run someting like on both hashes?

while(my($k,$v) = each(%$json1)) {
    $copy->{ NFD($k) } = NFD($v);
}

In short:

How to read different JSON files to get the same normalization 'inside' the perl $href? It is possible to achieve somewhat nicer as explicitly doing NFD on each key value and creating another NFD normalized (big) copy of the hashes?

Some hints, suggestions - pleae...

Because my english is very bad, here is a simulation of the problem

use 5.014;
use warnings;

use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);

use File::Slurp;
use Data::Dumper;
use JSON::XS;

#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';

#save as NFC - this comes from "FreeBSD"
my $jnfc =  JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;

#save as NFD - this comes from "OS X"
my $jnfd =  JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;

#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;

say $jd->{ $jc->{key} } // "NO FOUND";    #wanted to print "something"

my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
    $jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND";    #OK

score 1 · Accepted Answer

嗯。我不能建议你一些更好的“编程”解决方案。但是为什么根本不运行

perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < freebsd.json >bsdok.json
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < osx.json     >osxok.json

现在您的脚本可以读取和使用两者，因为它们都处于相同的规范化状态？因此，与其在脚本中搜索 som 编程解决方案，不如在进入脚本之前解决问题。（第二个命令是不必要的 - 文件级别的简单转换。当然更容易遍历数据结构......

score 1 · Accepted Answer

在为您的问题搜索正确的解决方案时，我发现：该软件是 c*rp :) 请参阅：https ://stackoverflow.com/a/17448888/632407 。

无论如何，为您的特定问题找到了解决方案 - 如何使用文件名读取 json 而不管规范化：

而不是你的：

#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;

使用下一个：

#now read them
my $jc = get_json_from_utf8_file('nfc.json') ;
my $jd = get_json_from_utf8_file('nfd.json') ;
...

sub get_json_from_utf8_file {
    my $file = shift;
    return
      decode_json      #let parse the json to perl
        encode 'utf8', #the decode_json want utf8 encoded binary string, encode it
          NFC          #conv. to precomposed normalization - regardless of the source
            read_file  #your file contains utf8 encoded text, so read it correctly
              $file, { binmode => ':utf8' } ;
}

这应该（至少我希望）确保无论什么分解使用 JSON 内容，NFC都会将其转换为预先组合的版本，并且 JSON:XS 将正确读取并将其解析为相同的内部 perl 结构。

所以你的例子打印：

something

无需遍历$json

这个想法来自约瑟夫迈尔斯和尼莫；）

也许一些更熟练的程序员会给出更多的提示。

score 1 · Accepted Answer

尽管现在仅将几个文件名转换为相同的规范化进行比较可能很重要，但如果 JSON 数据具有不同的规范化，则几乎任何地方都可能出现其他意外问题。

所以我的建议是在进行任何解析之前将来自两个来源的整个输入标准化作为您的第一步（即，在您读取文件的同时和之前decode_json）。这不应该破坏您的任何 JSON 结构，因为它们是使用 ASCII 字符分隔的。那么您现有的 perl 代码应该能够盲目地假设所有 UTF8 字符都具有相同的规范化。

$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";

my $json1 = decode_json NFD($rawdata1);
my $json2 = decode_json NFD($rawdata2);

为了让这个过程稍微快一点（它应该已经足够快了，因为模块使用了快速 XS 程序），你可以找出两个数据文件中的一个是否已经处于某种规范化形式，然后保持该文件不变，并且将另一个文件转换为该格式。

例如：

$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "...";
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "...";

if (checkNFD($rawdata1)) {
    # then you know $file1 is already in Normalization Form D
    # (i.e., it was formed by canonical decomposition).
    # so you only need to convert $file2 into NFD
    $rawdata2 = NFD($rawdata2);
}
my $json1 = decode_json $rawdata1;
my $json2 = decode_json $rawdata2;

当然，您现在自然必须在开发时进行试验，看看其中一个或其他输入文件是否已经处于规范化形式，然后在您的代码的最终版本中，您将不再需要条件语句，但只需将其他输入文件转换为相同的规范化形式。

另请注意，建议以 NFC 形式生成输出（如果您的程序生成任何将在以后存储和使用的输出）。见这里，例如：http ://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html

score -1 · Accepted Answer

与其手动遍历数据结构，不如让模块为您处理。

json - Normalization on utf8 filenames stored in JSON with perl

4 回答 4

Related

Reference