我有一个函数可以剥离 html 并将单词放在一个数组中,然后使用 array_count_values。我试图报告每个单词的出现次数。输出的数组很乱。我试图清理它,但我无处可去。我想删除电话号码,并且出于某种原因将短语放在一起。第一个数组似乎也为空,但 isset() 或 empty() 似乎并没有取消它。
$body = $this->get_response($domain);
$body = preg_replace('/<body(.*?)>/i', '<body>', $body);
$body = preg_replace('#</body>#i', '</body>', $body);
$openTag = '<body>';
$start = strpos($body, $openTag);
$start += strlen($openTag);
$closeTag = '</body>';
$end = strpos($body, $closeTag);
// Return if cannot cut-out the body
if ($end <= $start || $start === false || $end === false) {
$this->setValue('');
return;
}
$body = substr($body, $start, $end - $start);
$body = preg_replace(array(
'@<script[^>]*?>.*?</script>@si', // Strip out javascript
'@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
'@<![\s\S]*?--[ \t\n\r]*>@', // Strip multi-line comments including CDATA
'/style=([\"\']??)([^\">]*?)\\1/siU',// Strip inline style attribute
), '', $body);
$body = strip_tags($body);
$body = array_filter(explode(' ', $body), create_function('$str', 'return strlen($str) > 2;'));
$body = array_map('trim', $body);
$words = $body;
$i = 0;
$words = array_count_values($words);
foreach($words as $word){
if (empty($word)) unset($words[$i]);
$i++;
}
echo "<pre>";
print_r($words);
echo "</pre>";
输出
Array
(
[] => 28
[333.444.5555] => 1
[facebook] => 2
[twitter] => 2
[linkedin] => 2
[youtube
googleplus] => 1
[About
History
Our] => 1
[Mission
Who] => 1
[This
That
Other] => 1
[Us
English
FA
Football] => 1
[Media
Pay] => 2
[Per] => 4
[Think
Fast] => 2
[Marketing
Design] => 1
[Consulting
Case] => 2