1

有谁认识这种格式(见底部的粘贴)?它来自 Répertoire de vedettes-matière (RVM)。这两个都不是:

我可以用 Perl 编程,也发布为https://github.com/LibreCat/Catmandu-MARC/issues/88

我可以只用 XS::JSON 破解它,但我不知道如何处理这种奇怪的重音编码(从 325 显示的一些示例行):

{grave}e
{ring}Z
{ringb}h
{ringb}s
{rlig}a
{rlig}A

这是奇怪的 MARC JSON:

{
"rows" : [
{
    "RecordNumber" : "1",
    "Tag" : "LDR",
    "Indicators" : "",
    "Content" : "00533nz   2200205n  4500"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "001",
    "Indicators" : "\"  \"",
    "Content" : "201-0000001"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "005",
    "Indicators" : "\"  \"",
    "Content" : "20121025110000.0"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "008",
    "Indicators" : "\"  \"",
    "Content" : "790704\\nfanvnnbabn\\\\\\\\\\\\\\\\\\\\\\b\\ana\\\\\\\\\\\\"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "016",
    "Indicators" : "\\\\",
    "Content" : "$a0509B3366"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "035",
    "Indicators" : "\\\\",
    "Content" : "$a(ISM)8013850"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "035",
    "Indicators" : "9\\",
    "Content" : "$a201-0000001"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "040",
    "Indicators" : "\\\\",
    "Content" : "$aCaQQLa$bfre"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "150",
    "Indicators" : "\\\\",
    "Content" : "$aAlg{grave}ebres de Von Neumann"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "450",
    "Indicators" : "\\\\",
    "Content" : "$wnne$aVon Neumann, Alg{grave}ebres de"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "450",
    "Indicators" : "\\\\",
    "Content" : "$aW*-alg{grave}ebres"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "550",
    "Indicators" : "\\\\",
    "Content" : "$wg$aC*-alg{grave}ebres"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "550",
    "Indicators" : "\\\\",
    "Content" : "$wg$aEspace de Hilbert"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "697",
    "Indicators" : "\\\\",
    "Content" : "$amm."
}
,
{
    "RecordNumber" : "1",
    "Tag" : "750",
    "Indicators" : "\\7",
    "Content" : "$aVon Neumann, Alg{grave}ebres de$2ram"
}
,
{
    "RecordNumber" : "1",
    "Tag" : "750",
    "Indicators" : "\\0",
    "Content" : "$aVon Neumann algebras"
}
]
}

添加:此重音编码来自 MARCmkr。我使用了以下内容:

use MARC::File::MARCMaker; # https://metacpan.org/pod/MARC::File::MARCMaker
# for some reason can't be found by module name, so use:
# cpanm http://www.cpan.org/authors/id/E/EI/EIJABB/MARC-File-MARCMaker-0.05.tar.gz
my $marc_charset = MARC::File::MARCMaker::usmarc_default();
$content = MARC::File::MARCMaker::_maker2char ($content, $marc_charset);

但是,当我在此文本https://github.com/gmcharlt/marc-perl/blob/e8e0ecc92946d6dcb3c2270706041a30eff0f68d/marc-marcmaker/t/marcmaker.t#L92上对其进行测试时,它只是将重音符号/连字转换为 XML 实体。我尝试在浏览器中打开翻译后的文本:一些实体没有被解释,并且没有一个重读下一个字符。所以我想我现在需要使用一些“XML to Unicode”模块来完成翻译

This a test of diacritics like the uppercase Polish L in
Ł´od´z, the uppercase Scandinavia O in &Ostrok;st, the
uppercase D with crossbar in Đuro, the uppercase Icelandic
thorn in Þann, the uppercase digraph AE in Ægir, the
uppercase digraph OE in Œuvres, the soft sign in
rech&softsign;, the middle dot in col·lecci´o, the musical
flat in F♭, the patent mark in Frizbee®, the plus or minus
sign in ±54%, the uppercase O-hook in B&Ohorn;, the
uppercase U-hook in X&Uhorn;A, the alif in
mas&mlrhring;alah, the ayn in &mllhring;arab, the lowercase
Polish l in Włocław, the lowercase Scandinavian o in
K&ostrok;benhavn, the lowercase d with crossbar in đavola,
the lowercase Icelandic thorn in þann, the lowercase digraph
ae in være, the lowercase digraph oe in cœur, the lowercase
hardsign in s&hardsign;ezd, the Turkish dotless i in masalı,
the British pound sign in £5.95, the lowercase eth in
verður, the lowercase o-hook (with pseudo question mark) in
S&hooka;&ohorn;, the lowercase u-hook in T&uhorn; D&uhorn;c,
the pseudo question mark in c&hooka;ui, the grave accent in
tr`es, the acute accent in d´esir´ee, the circumflex in
cˆote, the tilde in ma˜nana, the macron in T¯okyo, the breve
in russki˘i, the dot above in ˙zaba, the dieresis (umlaut)
in L¨owenbr¨au, the caron (hachek) in ˇcrny, the circle
above (angstrom) in ˚arbok, the ligature first and second
halves in d&llig;i&rlig;ad&llig;i&rlig;a, the high comma off
center in rozdel&rcommaa;ovac, the double acute in
id˝oszaki, the candrabindu (breve with dot above) in
Ali&candra;iev, the cedilla in ¸ca va comme ¸ca, the right
hook in viet˛a, the dot below in te&dotb;da, the double dot
below in &under;k&under;hu&dbldotb;tbah, the circle below in
Sa&dotb;msk&ringb;rta, the double underscore in
&dblunder;Ghulam, the left hook in Lech Wał&commab;esa, the
right cedilla (comma below) in khŗong, the upadhmaniya (half
circle below) in &breveb;humantuˇs, double tilde, first and
second halves in &ldbltil;n&rdbltil;galan, high comma
(centered) in g&commaa;eotermika.
4

1 回答 1

1

这是一个编码问题。记录领导者说数据是用MARC-8编码的。您的 JSON 数据应以 UTF-8 编码。_maker2char()uses usmarc_default(),它将助记符的重音编码映射到 MARC-8 编码的字符。使用 MARC::Charset 将数据转换为 UTF-8。这应该有效:

#!/usr/bin/env perl

use 5.014;

use utf8;
use strict;
use autodie;
use warnings;

use MARC::File::MARCMaker;
use MARC::Charset qw(marc8_to_utf8);

my $data = q{This is a test of diacritics like the uppercase Polish L in {Lstrok}{acute}od{acute}z
the uppercase Scandinavia O in {Ostrok}st
the uppercase D with crossbar in {Dstrok}uro
the uppercase Icelandic thorn in {THORN}ann
the uppercase digraph AE in {AElig}gir
the uppercase digraph OE in {OElig}uvres
the soft sign in rech{softsign}
the middle dot in col{middot}lecci{acute}o
the musical flat in F{flat}
the patent mark in Frizbee{reg}
the plus or minus sign in {plusmn}54%
the uppercase O-hook in B{Ohorn}
the uppercase U-hook in X{Uhorn}A
the alif in mas{mlrhring}alah
the ayn in {mllhring}arab
the lowercase Polish l in W{lstrok}oc{lstrok}aw
the lowercase Scandinavian o in K{ostrok}benhavn
the lowercase d with crossbar in {dstrok}avola
the lowercase Icelandic thorn in {thorn}ann
the lowercase digraph ae in v{aelig}re
the lowercase digraph oe in c{oelig}ur
the lowercase hardsign in s{hardsign}ezd
the Turkish dotless i in masal{inodot}
the British pound sign in {pound}5.95
the lowercase eth in ver{eth}ur
the lowercase o-hook (with pseudo question mark) in S{hooka}{ohorn}
the lowercase u-hook in T{uhorn} D{uhorn}c
the pseudo question mark in c{hooka}ui
the grave accent in tr{grave}es
the acute accent in d{acute}esir{acute}ee
the circumflex in c{circ}ote
the tilde in ma{tilde}nana
the macron in T{macr}okyo
the breve in russki{breve}i
the dot above in {dot}zaba
the dieresis (umlaut) in L{uml}owenbr{uml}au
the caron (hachek) in {caron}crny
the circle above (angstrom) in {ring}arbok
the ligature first and second halves in d{llig}i{rlig}ad{llig}i{rlig}a
the high comma off center in rozdel{rcommaa}ovac
the double acute in id{dblac}oszaki
the candrabindu (breve with dot above) in Ali{candra}iev
the cedilla in {cedil}ca va comme {cedil}ca
the right hook in viet{ogon}a
the dot below in te{dotb}da
the double dot below in {under}k{under}hu{dbldotb}tbah
the circle below in Sa{dotb}msk{ringb}rta
the double underscore in {dblunder}Ghulam
the left hook in Lech Wa{lstrok}{commab}esa
the right cedilla (comma below) in kh{rcedil}ong
the upadhmaniya (half circle below) in {breveb}humantu{caron}s
double tilde
first and second halves in {ldbltil}n{rdbltil}galan
high comma (centered) in g{commaa}eotermika.
Alg{grave}ebres de Von Neumann
{grave}e
{ring}Z
{ringb}h
{ringb}s
{rlig}a
{rlig}A
};

my $marc_charset = MARC::File::MARCMaker::usmarc_default();
my $marc8 = MARC::File::MARCMaker::_maker2char($data, $marc_charset);

# prepare STDOUT for utf8
binmode(STDOUT, 'utf8');

# convert marc8 to utf8
my $utf8 = marc8_to_utf8($marc8);

say $utf8;

输出:

This is a test of diacritics like the uppercase Polish L in Łódź
the uppercase Scandinavia O in Øst
the uppercase D with crossbar in Đuro
the uppercase Icelandic thorn in Þann
the uppercase digraph AE in Ægir
the uppercase digraph OE in Œuvres
the soft sign in rechʹ
the middle dot in col·lecció
the musical flat in F♭
the patent mark in Frizbee®
the plus or minus sign in ±54%
the uppercase O-hook in BƠ
the uppercase U-hook in XƯA
the alif in masʼalah
the ayn in ʻarab
the lowercase Polish l in Włocław
the lowercase Scandinavian o in København
the lowercase d with crossbar in đavola
the lowercase Icelandic thorn in þann
the lowercase digraph ae in være
the lowercase digraph oe in cœur
the lowercase hardsign in sʺezd
the Turkish dotless i in masalı
the British pound sign in £5.95
the lowercase eth in verður
the lowercase o-hook (with pseudo question mark) in Sở
the lowercase u-hook in Tư Dưc
the pseudo question mark in củi
the grave accent in très
the acute accent in désirée
the circumflex in côte
the tilde in mañana
the macron in Tōkyo
the breve in russkiĭ
the dot above in żaba
the dieresis (umlaut) in Löwenbräu
the caron (hachek) in črny
the circle above (angstrom) in årbok
the ligature first and second halves in di͡adi͡a
the high comma off center in rozdelo̕vac
the double acute in időszaki
the candrabindu (breve with dot above) in Alii̐ev
the cedilla in ça va comme ça
the right hook in vietą
the dot below in teḍa
the double dot below in k̲h̲ut̤bah
the circle below in Saṃskr̥ta
the double underscore in G̳hulam
the left hook in Lech Wałe̦sa
the right cedilla (comma below) in kho̜ng
the upadhmaniya (half circle below) in ḫumantuš
double tilde
first and second halves in n͠galan
high comma (centered) in ge̓otermika.
Algèbres de Von Neumann
è
Z̊
h̥
s̥
a
A
于 2018-03-12T13:18:50.983 回答