google-apps-script - 使用 Google Apps 脚本从网页中提取数据时出现字符编码问题

Question

我使用 Google Apps 脚本编写了一个脚本，用于将网页中的文本提取到 Google 表格中。我只需要这个脚本来处理特定的网页，所以它不需要多才多艺。该脚本几乎完全按照我的意愿工作，只是我遇到了字符编码问题。我正在提取希伯来语和英语文本。HTML 中的元标记具有 charset=Windows-1255。英语完美提取，但希伯来语显示为包含问号的黑色菱形。

我发现这个问题说将数据传递到 blob 然后使用 getDataAsString 方法转换为另一种编码。我尝试转换为不同的编码并得到不同的结果。UTF-8 显示带有问号的黑色菱形，UTF-16 显示韩文，ISO 8859-8 返回错误并说它不是有效参数，原始 Windows-1255 显示一个希伯来字符，但还有一堆其他乱码。

但是，我可以手动将希伯来语文本复制并粘贴到 Google 表格中，并且可以正确显示。

我什至测试了直接从 Google Apps 脚本代码传递希伯来语，如下所示：

function passHebrew() {
  return "וַיְדַבֵּר";
}

这会在 Google 表格上正确显示希伯来文文本。

希伯来语显示为我提到的每种编码

我的代码如下：

function parseText(book, chapter) {
  //var bk = book;
  //var ch = chapter;
  var bk = '04'; //hard-coded for testing purposes
  var ch = '01'; //hard-coded for testing purposes
  var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';

  var xml = UrlFetchApp.fetch(url).getContentText();

  //I had to "fix" these xml errors for XmlService.parse(xml) below
  //to function.
  xml = xml.replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">', '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">');
  xml = xml.replace('<LINK REL="stylesheet" HREF="p.css" TYPE="text/css">', '<LINK REL="stylesheet" HREF="p.css" TYPE="text/css"></LINK>');
  xml = xml.replace('<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255">', '<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255"></meta>');
  xml = xml.replace(/ALIGN=CENTER/gi, 'ALIGN="CENTER"');
  xml = xml.replace(/<BR>/gi, '<BR></BR>');
  xml = xml.replace(/class=h/gi, 'class="h"');

  //This section is the specific route to the table in the page I want
  var document = XmlService.parse(xml);
  var body = document.getRootElement().getChildren("BODY");
  var maintable = body[0].getChildren("TABLE");
  var maintablechildren = maintable[0].getChildren();

  //This creates a two-dimensional array so that I can store the Hebrew
  //in the first column and the English in the second column
  var array = new Array(maintablechildren.length);
  for (var i = 0; i < maintablechildren.length; i++) {
    array[i] = new Array(2);
  }

  //This is where the table gets parsed into the array
  for (var i = 0; i < maintablechildren.length; i++) {
    var verse = maintablechildren[i].getChildren();

    //This is where the encoding problem occurs.
    //I originally tried verse[0].getText() but it didn't work.
    array[i][0] = Utilities.newBlob(verse[0].getText()).getDataAsString('UTF-8');
    //This array receives the English text and works fine.
    array[i][1] = verse[1].getText();
  }

  return array;
}

我忽略、误解或做错了什么？我对编码的工作原理不太了解，所以我不明白为什么将其转换为 UTF-8 不起作用。

score 6 · Accepted Answer

您的问题出现在您评论为编码问题的行之前：因为 UrlFetchApp 的默认编码从一开始就是修改 unicode 文本。

您应该使用返回编码为给定字符集字符串的 HTTP 响应内容的.getContentText()方法的变体。对于您的情况：

var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");

这应该是您需要更改的全部内容，尽管blob()不再需要解决方法。（不过，它是无害的。）其他评论：

逻辑 OR 运算符 ( ||) 对于设置默认值非常有用。我已经调整了前几行以启用测试，但仍然让函数使用参数正常运行。
在用字符串填充空数组之前设置空数组的方式是 Bad JavaScript；这是不需要的复杂代码，所以折腾它。相反，我们将声明array数组，然后push()在其上添加行。
.replace()可以通过更巧妙的 RegExp 使用来减少函数；我已经包含了真正棘手的演示的 URL。
文本中有\n换行符，我猜这对你的目的来说是不必要的，所以replace()也为它们添加了一个。

这是你剩下的：

function parseText(book, chapter) {
  var bk = book || '04'; //hard-coded for testing purposes
  var ch = chapter || '01'; //hard-coded for testing purposes
  var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';

  var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");

  //I had to "fix" these xml errors for XmlService.parse(xml) below
  //to function.
  xml = xml.replace(/(<!DOCTYPE.*EN")>/gi, '$1 "">')
           .replace(/(<(LINK|meta).*>)/gi,'$1</$2>')        // https://regex101.com/r/nH3pU8/1
           .replace(/(<.*?=)([^"']*?)([ >])/gi,'$1"$2"$3')  // https://regex101.com/r/eP7wO7/1
           .replace(/<BR>/gi, '<BR/>')
           .replace(/\n/g, '')

  //This section is the specific route to the table in the page I want
  var document = XmlService.parse(xml);
  var body = document.getRootElement().getChildren("BODY");
  var maintable = body[0].getChildren("TABLE");
  var maintablechildren = maintable[0].getChildren();

  //This is where the table gets parsed into the array
  var array = [];
  for (var i = 0; i < maintablechildren.length; i++) {
    var verse = maintablechildren[i].getChildren();

    //I originally tried verse[0].getText() but it didn't work.** It does now!
    var hebrew = verse[0].getText();
    //This array receives the English text and works fine.
    var english = verse[1].getText();
    array.push([hebrew,english]);
  }

  return array;
}

结果

 [
  [
    "  וַיְדַבֵּר יְהוָה אֶל-מֹשֶׁה בְּמִדְבַּר סִינַי, בְּאֹהֶל מוֹעֵד:  בְּאֶחָד לַחֹדֶשׁ הַשֵּׁנִי בַּשָּׁנָה הַשֵּׁנִית, לְצֵאתָם מֵאֶרֶץ מִצְרַיִם--לֵאמֹר.",
    " And the LORD spoke unto Moses in the wilderness of Sinai, in the tent of meeting, on the first day of the second month, in the second year after they were come out of the land of Egypt, saying:"
  ],
  [
    "  שְׂאוּ, אֶת-רֹאשׁ כָּל-עֲדַת בְּנֵי-יִשְׂרָאֵל, לְמִשְׁפְּחֹתָם, לְבֵית אֲבֹתָם--בְּמִסְפַּר שֵׁמוֹת, כָּל-זָכָר לְגֻלְגְּלֹתָם.",
    " 'Take ye the sum of all the congregation of the children of Israel, by their families, by their fathers' houses, according to the number of names, every male, by their polls;"
  ],
  [
    "  מִבֶּן עֶשְׂרִים שָׁנָה וָמַעְלָה, כָּל-יֹצֵא צָבָא בְּיִשְׂרָאֵל--תִּפְקְדוּ אֹתָם לְצִבְאֹתָם, אַתָּה וְאַהֲרֹן.",
    " from twenty years old and upward, all that are able to go forth to war in Israel: ye shall number them by their hosts, even thou and Aaron."
  ],
  ...

google-apps-script - 使用 Google Apps 脚本从网页中提取数据时出现字符编码问题

1 回答 1

结果

Related

Reference