10

我想使用 PHP 将文本拆分为单个单词。您知道如何实现这一目标吗?

我的做法:

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

这是一个好方法吗?你有什么改进的想法吗?

提前致谢!

4

6 回答 6

30

使用与任何 unicode 标点字符匹配的类 \p{P},并结合 \s 空白类。

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

这将拆分为一组一个或多个空白字符,但也会吸收任何周围的标点符号。它还匹配字符串开头或结尾的标点符号。这区分了诸如“不要”和“他说'哎哟!'”之类的情况

于 2009-04-26T10:24:50.277 回答
13

标记化-strtok

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>
于 2009-04-26T10:23:26.973 回答
3

在拆分之前,我会先将字符串设为小写。这将使i修改器和之后的数组处理变得不必要。此外,我会使用\W非单词字符的简写并添加一个+乘数。

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

编辑   使用Unicode 字符属性而不是\W marcog 建议的那样。(标点符号[\p{P}\p{Z}]和分隔符)之类的内容将涵盖比 . 更具体的字符\W

于 2009-04-26T10:35:09.240 回答
1

做:

str_word_count($text, 1);

或者,如果您需要 unicode 支持:

function str_word_count_Helper($string, $format = 0, $search = null)
{
    $result = array();
    $matches = array();

    if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
    {
        $result = $matches[0];
    }

    if ($format == 0)
    {
        return count($result);
    }

    return $result;
}
于 2009-04-26T10:24:48.950 回答
1

您还可以使用 PHP strtok() 函数从大字符串中获取字符串标记。你可以像这样使用它:

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

查看更多关于strtok()的 php 文档

于 2009-04-26T10:29:45.413 回答
1

你也可以使用explode方法:http: //php.net/manual/en/function.explode.php

$words = explode(" ", $sentence);
于 2012-10-10T00:23:46.320 回答