javascript - 为什么谷歌自然语言为分析的字符串返回不正确的 beginOffset？

Question

我正在使用 google-cloud/language api 进行#annotate 调用，并从我从各种在线资源中获取的评论的 csv 中分析实体和情绪。

首先，我要分析的字符串包括commentId，所以我重新格式化：

youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."

因此它不包含任何评论 ID：

I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s.   Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."

在发送对谷歌云/语言的请求以#annotate 文本后。我收到一个响应，其中包括各种子字符串的情绪和幅度。每个字符串还被赋予一个beginOffset值，该值与原始字符串（请求中的字符串）中的字符串索引相关。

{ content: 'i just bot a Nostromo... ( ._.)\nGood Job Baby!',
  beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can&#39;t be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51&#39;s"\n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox&#39;s channel in the description instead of Ryan&#39;s.',
  beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"\n"You know, If you actually made this.',
  beginOffset: 462 }

然后我的目标是在原始字符串中找到原始注释，这应该足够简单。像(originalString[beginOffset])......

这个值不正确！

我假设它们不包含某些字符，但我尝试了多种正则表达式，但似乎没有什么能完美运行。有谁知道可能导致问题的原因？？？

score 3 · Accepted Answer

我知道这是一个老问题，但即使在今天，这个问题似乎仍然存在。我最近遇到了同样的问题，并通过将 Google 的偏移量解释为“字节偏移量”而不是所选编码中的字符串偏移量来解决它。效果很好。我希望它可以帮助某人。

以下是一些 C# 代码，但任何人都应该能够解释它并用自己喜欢的语言重新编码。如果我们假设这text实际上是正在分析的情感文本，那么下面的代码会转换，谷歌的偏移量为正确的偏移量。

int TransformOffset(string text, int offset)
{
   return Encoding.UTF8.GetString(
             Encoding.UTF8.GetBytes(text),
             0,
             offset)
          .Length;
}

score 0 · Accepted Answer

这与编码有关。使用其中一种编码或简单地使用其 github 存储库中提供的示例方法之一：

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/language/api/analyze.py

关键代码块：


def get_native_encoding_type():
    """Returns the encoding type that matches Python's native strings."""
    if sys.maxunicode == 65535:
        return 'UTF16'
    else:
        return 'UTF32'

这对我有用。它弄乱了诸如'（即 unicode 中的 \u2019 ）之类的字符。

score 0 · Accepted Answer

您应该在请求上设置 EncodingType。

使用 Java 客户端库并处理 UTF-8 编码文本的示例：

Document doc = Document.newBuilder().setContent(dreamText).setType(Type.PLAIN_TEXT).build();
        
AnalyzeEntitiesRequest request = AnalyzeEntitiesRequest.newBuilder().setEncodingType(EncodingType.UTF8).setDocument(doc).build();

javascript - 为什么谷歌自然语言为分析的字符串返回不正确的 beginOffset？

3 回答 3

Related

Reference