5

我正在使用 google-cloud/language api 进行#annotate 调用,并从我从各种在线资源中获取的评论的 csv 中分析实体和情绪。

首先,我要分析的字符串包括commentId,所以我重新格式化:

youtubez22htrtb1ymtdlka404t1aokg2kirffb53u3pya0,i just bot a Nostromo... ( ._.)
youtubez22oet0bruejcdf0gacdp431wxg3vb2zxoiov1da,Good Job Baby! MSI Propeller Blade Technology!
youtubez22ri11akra4tfku3acdp432h1qyzap3yy4ziifc,"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
youtubez23ttpsyolztc1ep004t1aokg5zuyqxfqykgyjqs,"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's. Nice Alienware thing logo thing, btw"
youtubez12zjp5rupbcttvmy220ghf4ctqnerqwa04,"You know, If you actually made this. People would actually buy it."

因此它不包含任何评论 ID:

I just bot a Nostromo... ( ._.)
Good Job Baby! MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"
"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.   Nice Alienware thing logo thing, btw"
"You know, If you actually made this. People would actually buy it."

在发送对谷歌云/语言的请求以#annotate 文本后。我收到一个响应,其中包括各种子字符串的情绪和幅度。每个字符串还被赋予一个beginOffset值,该值与原始字符串(请求中的字符串)中的字符串索引相关。

{ content: 'i just bot a Nostromo... ( ._.)\nGood Job Baby!',
  beginOffset: 0 }
{ content: 'MSI Propeller Blade Technology!\n"exactly, i have to deal with that damned brick, and the power supply can't be upgraded because of it, because as far as power supply goes, i have never seen an external one on newegg that has more power then the x51's"\n"I like how people are liking your comment about liking the fact that Sky DID put Deadlox's channel in the description instead of Ryan's.',
  beginOffset: 50 }
{ content: 'Nice Alienware thing logo thing, btw"\n"You know, If you actually made this.',
  beginOffset: 462 }

然后我的目标是在原始字符串中找到原始注释,这应该足够简单。像(originalString[beginOffset])......

这个值不正确!

我假设它们不包含某些字符,但我尝试了多种正则表达式,但似乎没有什么能完美运行。有谁知道可能导致问题的原因???

4

3 回答 3

3

我知道这是一个老问题,但即使在今天,这个问题似乎仍然存在。我最近遇到了同样的问题,并通过将 Google 的偏移量解释为“字节偏移量”而不是所选编码中的字符串偏移量来解决它。效果很好。我希望它可以帮助某人。

以下是一些 C# 代码,但任何人都应该能够解释它并用自己喜欢的语言重新编码。如果我们假设这text实际上是正在分析的情感文本,那么下面的代码会转换,谷歌的偏移量为正确的偏移量。

int TransformOffset(string text, int offset)
{
   return Encoding.UTF8.GetString(
             Encoding.UTF8.GetBytes(text),
             0,
             offset)
          .Length;
}
于 2019-12-18T09:54:34.037 回答
0

这与编码有关。使用其中一种编码或简单地使用其 github 存储库中提供的示例方法之一:

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/language/api/analyze.py

关键代码块:


def get_native_encoding_type():
    """Returns the encoding type that matches Python's native strings."""
    if sys.maxunicode == 65535:
        return 'UTF16'
    else:
        return 'UTF32'

这对我有用。它弄乱了诸如'(即 unicode 中的 \u2019 )之类的字符。

于 2019-10-10T03:02:08.903 回答
0

您应该在请求上设置 EncodingType。

使用 Java 客户端库并处理 UTF-8 编码文本的示例:

Document doc = Document.newBuilder().setContent(dreamText).setType(Type.PLAIN_TEXT).build();
        
AnalyzeEntitiesRequest request = AnalyzeEntitiesRequest.newBuilder().setEncodingType(EncodingType.UTF8).setDocument(doc).build();
于 2021-01-08T19:42:14.523 回答