php - 网站关键词，无限加载，OOP

Question

我正在从不同的网站创建关键字，并将它们保存在 bd 中。

在 $webhtml 中，我有一个使用 DOM 下载的网站。问题是，当您启动提取器时，会无限加载。并且永远不会在数据库中保存任何内容。

有错误的代码如下：

if (empty($keywords)){
$ekeywords = new KeyPer;
$keywords = $ekeywords->Keys($webhtml);
}

我用一个函数存储了 $keywords：saveweb($url, $description, $keywords);

我使用下一个包括：

include("Extkeys.php");

“Extkeys”的代码：

<?php
class Extkeys {
function Keys($webhtml) { 
$webhtml = $this->clean($webhtml); 
$blacklist='de,la,los,las,el,ella,nosotros,yo,tu,el,te,mi,del,ellos'; 
$sticklist='test'; 
$minlength = 3; 
$count = 17; 

$webhtml = preg_replace('/[\.;:|\'|\"|\`|\,|\(|\)|\-]/', ' ', $webhtml); 
$webhtml = preg_replace('/¡/', '', $webhtml); 
$webhtml = preg_replace('/¿/', '', $webhtml);

$keysArray = explode(" ", $webhtml); 
$keysArray = array_count_values(array_map('strtolower', $keysArray)); 
$blackArray = explode(",", $blacklist); 

foreach($blackArray as $blackWord){ 
if(isset($keysArray[trim($blackWord)])) 
unset($keysArray[trim($blackWord)]); 
} 
arsort($keysArray); 
$i = 1; 
$keywords = ""; 
foreach($keysArray as $word => $instances){ 
if($i > $count) break; 
if(istrlen(trim($word)) >= $minlength && is_string($word)) { 
$keywords .= $word . ", "; 
$i++; 
} 
} 

$keywords = rtrim($keywords, ", "); 

return $keywords=$sticklist.''.$keywords; 
} 

function clean($webhtml) { 

$regex = '/(([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*@([A-Za-z0-9-]+)(\\.[A-Za-z0-9-]+)*)/iex'; 
$desc = preg_replace($regex, '', $webhtml); 
$webhtml = preg_replace( "''si", '', $webhtml ); 
$webhtml = preg_replace( '/]*>([^<]+)<\/a>/is', '\2 (\1)', $webhtml ); 
$webhtml = preg_replace( '//', '', $webhtml ); 
$webhtml = preg_replace( '/{.+?}/', '', $webhtml ); 
$webhtml = preg_replace( '/ /', ' ', $webhtml ); 
$webhtml = preg_replace( '/&/', ' ', $webhtml ); 
$webhtml = preg_replace( '/"/', ' ', $webhtml ); 
$webhtml = strip_tags( $webhtml ); 
$webhtml = htmlspecialchars($webhtml); 
$webhtml = str_replace(array("\r\n", "\r", "\n", "\t"), " ", $webhtml); 

while (strchr($webhtml," ")) { 
$webhtml = str_replace(" ", " ",$webhtml); 
} 

for ($cnt = 1; 
$cnt < strlen($webhtml)-1; $cnt++) {
if (($webhtml{$cnt} == '.') || ($webhtml{$cnt} == ',')) { 
if ($webhtml{$cnt+1} != ' ') { 
$webhtml = substr_replace($webhtml, ' ', $cnt + 1, 0); 
} 
} 
} 
return $webhtml; 
} 
}
?>

我怎样才能避免代码的无限负载，并使其正常工作？

此致！

score 2 · Accepted Answer

在您的代码中，您正在用空格替换空格，

while (strchr($webhtml," ")) { 
    $webhtml = str_replace(" ", " ",$webhtml); 
}

它应该像

while (strchr($webhtml," ")) { 
    $webhtml = str_replace(" ", "",$webhtml); 
}

score 0 · Accepted Answer

伙计，我知道你想做什么。您需要lynx或链接将某些网页转储到文件（文本文件）。您将跳过有关删除标签等的所有内容，因为 lynx 和链接都是基于文本的 Web 浏览器（通常从 linux 中的命令行启动）并且它们仅在页面上显示文本而不是其他内容。

另外你的关键字功能不好。你需要tf-idf。更多关于 tf-idf 的信息。

使用 tf-idf，您可以从网页中提取真正的关键字（这就是 google 关键字的制作方式）。tf-idf 用于提取真实的文本含义，提取最能描述页面或文档的关键字。

在那个链接上，我给了你计算 tf-idf 的公式。

php - 网站关键词，无限加载，OOP

2 回答 2

Related

Reference