0

我正在编写将文本分解为单词并执行诸如计算单词大小等操作的代码。

我想出了这个(经过一番搜索):

$text = preg_replace("/[^[:alnum:][:space:]]/u", ' ', $text);
$words = mb_split( ' +', $text );

但是,收缩不起作用,因为撇号和单引号看起来相同(因为它们是)。

我需要一种方法来分离单词但包括收缩。目前,我已经包含了所有我认为是停用词的收缩,但这是最不令人满意的。我不擅长正则表达式,需要一些建议。

尽管我发布了自己不优雅的解决方案,但我仍将这个问题保持开放,以期鼓励更完美的答案。

4

2 回答 2

1

找到了一个更好的方法,使用单词中允许的单词边界和字符,可以直接统计单词​​:

<?php

$text = "One morning, when Gregor Samsa woke from troubled dreams, 
he found himself transformed in his bed into a horrible vermin. 
'He lay on his armour-like back', and if he lifted his head a 
little he could see his brown belly, slightly domed and divided by arches
into stiff sections. The bedding was hardly able to cover it and 
seemed ready to slide off any moment. His many legs, pitifully thin 
compared with the size of the rest of him, waved about helplessly as he 
looked. \"What's happened to me?\" he thought. It wasn't a dream. His 
room, a proper human room although a little too small, lay peacefully
between its four familiar walls. A collection of textile samples lay 
spread out on the table - Samsa was a travelling salesman - and 
above it there hung a picture that he had recently cut out of an 
illustrated magazine and housed in a nice, gilded frame. It showed 
a lady fitted out with a fur hat and fur boa who sat upright, 
raising a heavy fur muff that covered the whole of her lower arm 
towards the viewer. Gregor then turned to look out the window at the 
dull weather";

preg_match_all("/\b[\w'-]+\b/", $text, $words);
print_r(count($words[0]));

注意:我允许-with'存在于单词中。像“盔甲状”将算作一个词。

正则表达式测试:regexr.com/4ego6

于 2019-05-22T21:37:13.583 回答
0

我已经为此努力了一段时间。这些评论和 Taha Paksu 非常有效的解决方案有助于帮助我思考问题。Taha Paksu 的解决方案清楚地隔离了单词,除了重音字母。谷歌搜索似乎表明 RegEx 对非 ascii 字符不太友好。

正是当我放弃尝试做 regex voodoo(任何能做到的人都受到我最深切的敬意)时,我才想出了这个不那么优雅的 hack。

$text = "Testing text. Café is spelled true. And pokémon too... ‘bad quotes’. (brackets)... Löwen, Bären, Vögel und Käfer sind Tiere. That’s what I said.";
$text = str_replace(array('’',"'"), '000AP000', $text);
$text = str_replace("-", '000HY000', $text);
$text = preg_replace("/[^[:alnum:][:space:]]/u", ' ', $text);
$text = str_replace('000AP000', "'", $text);
$text = str_replace('000HY000', "-", $text);
$text = str_replace(array("' ",'- ','  '," '",' -','  '), ' ', $text);
$words = mb_split( ' +', $text );

它使用两个统计上不太可能的字符串作为占位符,清理其余部分,将连字符和撇号放回原处,然后取出任何接触空格(和多个空格)的内容。它适用于我能找到的所有东西。

如果可以的话,我想找到一个不那么繁琐的解决方案,但我的正则表达式技能可能无法胜任这项任务(即使打开了备忘单)。

于 2019-05-23T01:12:36.337 回答