0

我正在使用 Google Natural Language API 和 PHP 客户端库处理一些文本,并且我想生成原始文本的副本,该副本复制您可以在 Google Natural Language 试用页面的屏幕截图中看到的格式,其中这些实体被突出显示,并有一个将它们与其父术语相关联的索引:

样本输出的屏幕截图

从结果中,我有实体名称、提及和 beginOffset。例如:

array (
    'name' => 'Google Cloud Natural Language API',
    'type' => 'OTHER',
    'metadata' => 
    array (
      'mid' => '/g/11bc5pm43l',
      'wikipedia_url' => 'https://pl.wikipedia.org/wiki/NAPI_(API)',
    ),
    'salience' => 0.045935749999999997417177155512035824358463287353515625,
    'mentions' => 
    array (
      0 => 
      array (
        'text' => 
        array (
          'content' => 'Google Cloud Natural Language API',
          'beginOffset' => 90,
        ),
        'type' => 'PROPER',
        'sentiment' => 
        array (
          'magnitude' => 0.90000000000000002220446049250313080847263336181640625,
          'score' => 0.90000000000000002220446049250313080847263336181640625,
        ),
      ), //and so on

我正在从结果中提取主要变量以修改原始文本以包含结果中的相关信息:

  array (
    0 => 'Google Cloud Natural Language API', //The parent term
    1 => 'OTHER',
    2 => 0.045935749999999997417177155512035824358463287353515625,
    3 => 1.600000000000000088817841970012523233890533447265625,
    4 => 0,
    5 => 16, //This term has 16 associated mentions
    6 => 
    array ( //Array containing all of the associated mentions
      0 => 'Google Cloud Natural Language API',
      1 => 'Natural Language API',
      2 => 'Natural Language API',
      3 => 'Natural Language API',
      4 => 'Natural Language API',
      5 => 'REST API',
      6 => 'Natural Language API',
      7 => 'Natural Language API',
      8 => 'Natural Language API',
      9 => 'Natural Language API',
      10 => 'Natural Language API',
      11 => 'Natural Language API',
      12 => 'Natural Language API',
      13 => 'Natural Language API',
      14 => 'Natural Language API',
      15 => 'Natural Language API',
    ),
    7 => 
    array ( //Array containing the beginOffset of each associated mention
      0 => 90,
      1 => 196,
      2 => 321,
      3 => 463,
      4 => 2421,
      5 => 2447,
      6 => 2946,
      7 => 6167,
      8 => 6414,
      9 => 8958,
      10 => 12039,
      11 => 12168,
      12 => 12256,
      13 => 13179,
      14 => 13294,
      15 => 13802,
    ),
  ),

到目前为止,我已经尝试过这个:

<?php

# Open file
$myfile = fopen("sampleText.txt", "r") or die("Unable to open file!");
$data = fread($myfile, filesize("sampleText.txt"));
fclose($myfile);

echo 'Original Text: <br>';
echo $data;
echo '<br>';
echo '<br>';

# Entities occurrence List
$entitiesList = array (
      0 => 'Google Cloud Natural Language API',
      1 => 'Natural Language API',
      2 => 'Natural Language API',
      3 => 'Natural Language API',
      4 => 'Natural Language API',
      5 => 'REST API',
      6 => 'Natural Language API',
      7 => 'Natural Language API',
      8 => 'Natural Language API',
      9 => 'Natural Language API',
      10 => 'Natural Language API',
      11 => 'Natural Language API',
      12 => 'Natural Language API',
      13 => 'Natural Language API',
      14 => 'Natural Language API',
      15 => 'Natural Language API',
    );

# Samples of ofsetts
$ofsettList = array (
      0 => 90,
      1 => 196,
      2 => 321,
      3 => 463,
      4 => 2421,
      5 => 2447,
      6 => 2946,
      7 => 6167,
      8 => 6414,
      9 => 8958,
      10 => 12039,
      11 => 12168,
      12 => 12256,
      13 => 13179,
      14 => 13294,
      15 => 13802,
    );

# Size of ofsetts List
$ofsettListLenght = sizeof($ofsettList);

# Index of the entity in the returned results
$index = 1;

# Temporal values array with new formatted string
$tempAmendedEntity = [];

for($i = 0; $i < $ofsettListLenght; $i++) {
    $tempAmendedEntity[] = '('. $entitiesList[$i] . ')' . $index;
}

echo 'List of new amended  Strings';
echo '<pre>', var_export($tempAmendedEntity, true), '</pre>', "\n";
echo '<br>';
echo '<br>';

// Method 1
for($i = 0; $i < $ofsettListLenght; $i++) {
    $temp1 = str_replace(substr($data, $ofsettList[$i], strlen($entitiesList[$i])), $tempAmendedEntity[$i] , $data);    
}

echo 'Text after method 1: <br>';
echo '<pre>', var_export($temp1, true), '</pre>', "\n";
echo '<br>';
echo '<br>';


// Method 2
$keyPairArray = [];

for($i = 0; $i < $ofsettListLenght; $i++) {

    $keyPairArray[$entitiesList[$i]] = $tempAmendedEntity[$i];
}

echo 'List of key => value strings';
echo '<pre>', var_export($keyPairArray, true), '</pre>', "\n";
echo '<br>';
echo '<br>';

$temp2 = strtr($data, $keyPairArray);

echo 'Text after method 2: <br>';
echo '<pre>', var_export($temp2, true), '</pre>', "\n";

使用方法 1:并非所有新字符串值都被替换,例如,结果中不存在 (Google Cloud Natural Language API)1 和 (REST API)1。

使用方法2:将所有与原始字符串匹配的字符串替换为新字符串,但这包括与父项无关的匹配字符串的出现。

如果能够仅替换以特定“beginOffset”开头的字符串来替换新修改的字符串,那就太好了。

我用于测试的文本可以从这里下载:sampleText.txt

4

1 回答 1

0

我发现这对我正在从事的项目很有用,因为当您从 Google Natural Language 收到答案时,JSON 响应包含对实体的所有提及以及原始文本中的偏移量,但很难想象在文字是具体提到的是。因此,我将所有提及返回并排序在一个 key => value 对数组中,然后使用这两个函数来重建原始文本,包括每个单词的偏移量作为唯一标识符,以便轻松找到该单词在文本中的位置。

$originalString 是key => value 实体数组中的实体值,$originalOfsett 是key。然后我形成了一个新的字符串,包括一个用于样式的 HTML span 标签:

function stringReplacementInfo($originalString, $originalOfsett) {

    $modifiedString = " <span id=\"entity\">(" . $originalString . ")</span><span id=\"index\">" . $originalOfsett . "</span> ";
    replaceString($originalString, $originalOfsett, $modifiedString);
}

然后,我将原文中修改后的字符串一一替换:

function replaceString($originalString, $originalOfsett, $modifiedString) {

    global $data;

    $originalStringLenght = mb_strlen($originalString, '8bit');

    $preChunk = '';
    $postChunk = '';
    $chunk = '';

    $preChunk = substr($data, 0, $originalOfsett);
    $chunk = substr($data, $originalOfsett, $originalStringLenght);
    $postChunk = substr($data, $originalOfsett + $originalStringLenght);
    $data = $preChunk . $modifiedString . $postChunk;
}

然后我使用了一些 CSS 让它看起来更像自然语言 API 网络测试版本的结果:

#index{
    font-size: 0.7em;
    color: #3d1766;
}

通过偏移值可以更轻松地找到每个返回的实体提及,最终结果如下所示:

在此处输入图像描述

希望这对其他人有帮助。

于 2018-08-04T12:12:23.157 回答