0

我有一个函数可以剥离 html 并将单词放在一个数组中,然后使用 array_count_values。我试图报告每个单词的出现次数。输出的数组很乱。我试图清理它,但我无处可去。我想删除电话号码,并且出于某种原因将短语放在一起。第一个数组似乎也为空,但 isset() 或 empty() 似乎并没有取消它。

$body = $this->get_response($domain);
                $body = preg_replace('/<body(.*?)>/i', '<body>', $body);
                $body = preg_replace('#</body>#i', '</body>', $body);

                $openTag = '<body>';
                $start = strpos($body, $openTag);
                $start += strlen($openTag);

                $closeTag = '</body>';
                $end = strpos($body, $closeTag);

                // Return if cannot cut-out the body
                if ($end <= $start || $start === false || $end === false) {
                    $this->setValue('');
                    return;
                }

                $body = substr($body, $start, $end - $start);
                $body = preg_replace(array(
                       '@<script[^>]*?>.*?</script>@si',    // Strip out javascript
                       '@<style[^>]*?>.*?</style>@siU',     // Strip style tags properly
                       '@<![\s\S]*?--[ \t\n\r]*>@',         // Strip multi-line comments including CDATA
                       '/style=([\"\']??)([^\">]*?)\\1/siU',// Strip inline style attribute
                       ), '', $body);

                $body = strip_tags($body);
                $body = array_filter(explode(' ', $body), create_function('$str', 'return strlen($str) > 2;'));
                $body = array_map('trim', $body);
                $words = $body;

                $i = 0;

                $words = array_count_values($words);

                foreach($words as $word){

                    if (empty($word)) unset($words[$i]);
                    $i++;

                }

                echo "<pre>";
                    print_r($words);
                    echo "</pre>";

输出

Array
(
    [] => 28
    [333.444.5555] => 1
    [facebook] => 2
    [twitter] => 2
    [linkedin] => 2
    [youtube

                googleplus] => 1
    [About

    History
    Our] => 1
    [Mission
    Who] => 1
    [This
     That
     Other] => 1
    [Us


English

    FA
    Football] => 1
    [Media
    Pay] => 2
    [Per] => 4
    [Think
    Fast] => 2
    [Marketing
    Design] => 1
    [Consulting


Case] => 2
4

1 回答 1

1

恐怕explode(' ', $body)还不够,因为空格不是唯一的空白字符。试试preg_split吧。

$body = array_filter(preg_split('/\s+/', $body), 
            create_function('$str', 'return strlen($str) > 2;'));
于 2012-09-20T16:18:17.767 回答