I have two Json files which come from different OSes.
Both files are encoded in UTF-8
and contain UTF-8
encoded filenames
.
One file comes from OS X and the filename is in NFD form: (od -bc
)
0000160 166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145
v e t l a ́ ** / H o u s e m e
the second contains the same filename but in NFC form:
000760 166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163
v e t l á ** / H o u s e m e s
As I have learned, this is called 'different normalization', and there is an CPAN module Unicode::Normalize
for handling it.
I'm reading both files with the next:
my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ;
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;
The read_file is from File::Slurp
and decode_json from the JSON::XS
.
Reading the JSON into perl structure, from one json file the filename comes into key
position and from the second file comes into the values
. I need to search when the hash key
from the 1st hash is equvalent to a value
from the second hash, so need ensure than they are "binary" identical.
Tried the next:
grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
and
grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc
produces for me the same output.
Now the questions:
- How to simply read both json files to get the same normalization into the both
$hashrefs
?
or need after the decode_json
run someting like on both hashes?
while(my($k,$v) = each(%$json1)) {
$copy->{ NFD($k) } = NFD($v);
}
In short:
- How to read different JSON files to get the same normalization 'inside' the perl
$href
? It is possible to achieve somewhat nicer as explicitly doingNFD
on eachkey
value
and creating another NFD normalized (big) copy of the hashes?
Some hints, suggestions - pleae...
Because my english is very bad, here is a simulation of the problem
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use Data::Dumper;
use JSON::XS;
#Creating two files what contains different "normalizations"
my($nfc, $nfd);;
$nfc->{ NFC('key') } = NFC('vál');
$nfd->{ NFD('vál') } = 'something';
#save as NFC - this comes from "FreeBSD"
my $jnfc = JSON::XS->new->encode($nfc);
open my $fd, ">:utf8", "nfc.json" or die("nfc");
print $fd $jnfc;
close $fd;
#save as NFD - this comes from "OS X"
my $jnfd = JSON::XS->new->encode($nfd);
open $fd, ">:utf8", "nfd.json" or die("nfd");
print $fd $jnfd;
close $fd;
#now read them
my $jc = decode_json read_file( "nfc.json", { binmode => ':raw' } ) or die "No file" ;
my $jd = decode_json read_file( "nfd.json", { binmode => ':raw' } ) or die "No file" ;
say $jd->{ $jc->{key} } // "NO FOUND"; #wanted to print "something"
my $jc2;
#is here a better way to DO THIS?
while(my($k,$v) = each(%$jc)) {
$jc2->{ NFD($k) } = NFD($v);
}
say $jd->{ $jc2->{key} } // "NO FOUND"; #OK