0

我收集了一些作者的文章。每个作者都有一个独特的签名或链接,出现在他们的所有文本中。

作者 1 的示例:

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

Author1 的预期输出为:@jhsad.sadas.com


作者 2 的示例:

$texts=['This is some random string representative of non-signature text.

This is the
*author\'s* signature.',
'Different message body text.      This is the
*author\'s* signature.

This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];

Author2 的预期输出为:

This is the
 *author's* signature.

请特别注意没有可靠的识别字符(或位置)来表示签名的开始或结束。它可以是 url、Twitter 提及、任何类型的纯文本等,任何长度,包含出现在字符串开头、结尾或中间的任何字符序列。

我正在寻找一种方法,该方法将提取$text单个作者的所有元素中存在的最长子字符串。

为了这项任务,预计所有作者都将有一个签名子字符串,该子字符串存在于每个帖子/文本中。

IDEA:我正在考虑将单词转换为向量并找到每个文本之间的相似性。我们可以使用余弦相似度来找到签名。我认为解决方案一定是这样的想法。

mickmackusa 的注释代码捕获了所需内容的本质,但我想看看是否有其他方法可以达到预期结果。

4

2 回答 2

2

您可以使用preg_match()正则表达式来实现这一点。

$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf";

preg_match("/\@[^\s]+/", $str, $match);

var_dump($match); //Will output the signature
于 2017-10-13T11:21:07.433 回答
2

这是我的想法:

  1. 按字符串长度(升序)对作者的帖子集合进行排序,以便您从较小的文本到较大的文本。
  2. 将每个帖子的文本拆分为一个或多个空白字符,以便在处理期间仅处理完全非空白的子字符串。
  3. 查找出现在每个后续帖子中的匹配子字符串,而不是不断缩小的子字符串数组 ( overlaps)。
  4. 通过分析它们的索引值对连续匹配的子串进行分组。
  5. 将分组的连续子字符串“重构”为它们的原始字符串形式(当然,修剪前导和尾随空白字符)。
  6. 按字符串长度(降序)对重构的字符串进行排序,以便为最长的字符串分配0索引。
  7. 打印以根据共性和长度筛选假定为作者签名(作为最佳猜测)的子字符串。

代码:(演示

$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

$posts['Author2'] = ['This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
        'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
        'Finally, this is unwanted stuff. This is the
 *author\'s* signature.'];

foreach ($posts as $author => $texts) {
    echo "Author: $author\n";
    
    usort($texts, function($a, $b) {
        return strlen($a) <=> strlen($b);  // sort ASC by strlen; mb_strlen probably isn't advantageous
    });
    var_export($texts);
    echo "\n";

    foreach ($texts as $index => $string) {
        if (!$index) {
            $overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY);  // declare with all non-white-space substrings from first text
        } else {
            $overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY));  // filter word bank using narrowing number of words
        }
    }
    var_export($overlaps);
    echo "\n";
    
    // batch consecutive substrings
    $group = null;
    $consecutives = [];  // clear previous iteration's data
    foreach ($overlaps as $i => $word) {
        if ($group === null || $i - $last > 1) {
            $group = $i;
        }
        $last = $i;
        $consecutives[$group][] = $word;
    }
    var_export($consecutives);
    echo "\n";
    
    foreach($consecutives as $words){
        // match potential signatures in first text for measurement:
        if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) {  // make alternatives characters literal using \Q & \E
            $potential_signatures = $out[0];
        }
    }
    usort($potential_signatures, function($a,$b){
        return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
    });
    
    echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}

输出:

Author: Author1
array (
  0 => 'sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
  1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
  2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
  11 => '@jhsad.sadas.com',
)
array (
  11 => 
  array (
    0 => '@jhsad.sadas.com',
  ),
)
Assumed Signature: @jhsad.sadas.com

Author: Author2
array (
  0 => 'Finally, this is unwanted stuff. This is the
 *author\'s* signature.',
  1 => 'This is some random string representative of non-signature text.

This is the
 *author\'s* signature.',
  2 => 'Different message body text.      This is the
 *author\'s* signature.

    This is an afterthought that expresses that a signature is not always at the end.',
)
array (
  2 => 'is',
  5 => 'This',
  6 => 'is',
  7 => 'the',
  8 => '*author\'s*',
  9 => 'signature.',
)
array (
  2 => 
  array (
    0 => 'is',
  ),
  5 => 
  array (
    0 => 'This',
    1 => 'is',
    2 => 'the',
    3 => '*author\'s*',
    4 => 'signature.',
  ),
)
Assumed Signature: This is the
 *author's* signature.
于 2017-11-07T02:24:15.203 回答