这是我的想法:
- 按字符串长度(升序)对作者的帖子集合进行排序,以便您从较小的文本到较大的文本。
- 将每个帖子的文本拆分为一个或多个空白字符,以便在处理期间仅处理完全非空白的子字符串。
- 查找出现在每个后续帖子中的匹配子字符串,而不是不断缩小的子字符串数组 (
overlaps
)。
- 通过分析它们的索引值对连续匹配的子串进行分组。
- 将分组的连续子字符串“重构”为它们的原始字符串形式(当然,修剪前导和尾随空白字符)。
- 按字符串长度(降序)对重构的字符串进行排序,以便为最长的字符串分配
0
索引。
- 打印以根据共性和长度筛选假定为作者签名(作为最佳猜测)的子字符串。
代码:(演示)
$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
$posts['Author2'] = ['This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];
foreach ($posts as $author => $texts) {
echo "Author: $author\n";
usort($texts, function($a, $b) {
return strlen($a) <=> strlen($b); // sort ASC by strlen; mb_strlen probably isn't advantageous
});
var_export($texts);
echo "\n";
foreach ($texts as $index => $string) {
if (!$index) {
$overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text
} else {
$overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words
}
}
var_export($overlaps);
echo "\n";
// batch consecutive substrings
$group = null;
$consecutives = []; // clear previous iteration's data
foreach ($overlaps as $i => $word) {
if ($group === null || $i - $last > 1) {
$group = $i;
}
$last = $i;
$consecutives[$group][] = $word;
}
var_export($consecutives);
echo "\n";
foreach($consecutives as $words){
// match potential signatures in first text for measurement:
if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) { // make alternatives characters literal using \Q & \E
$potential_signatures = $out[0];
}
}
usort($potential_signatures, function($a,$b){
return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
});
echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}
输出:
Author: Author1
array (
0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
11 => '@jhsad.sadas.com',
)
array (
11 =>
array (
0 => '@jhsad.sadas.com',
),
)
Assumed Signature: @jhsad.sadas.com
Author: Author2
array (
0 => 'Finally, this is unwanted stuff. This is the
*author\'s* signature.',
1 => 'This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
2 => 'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
)
array (
2 => 'is',
5 => 'This',
6 => 'is',
7 => 'the',
8 => '*author\'s*',
9 => 'signature.',
)
array (
2 =>
array (
0 => 'is',
),
5 =>
array (
0 => 'This',
1 => 'is',
2 => 'the',
3 => '*author\'s*',
4 => 'signature.',
),
)
Assumed Signature: This is the
*author's* signature.