php - 用于查找/识别域名中的单词的 PHP 脚本

Question

我正在寻找可以识别域名中单词的 php 代码/脚本。

例如，当用户查询域名 snapnames.com - 此脚本将显示 SnapNames.com（识别此域中的 2 个单词：Snap Names）

希望有人可以帮助

谢谢

score 2 · Accepted Answer

恐怕没有完美的答案......正如阿诺德所说，像“expertsexbhange.com”这样的域可以评估为“Expert Sex Change.com”以及“Experts Exchange.com”。

不仅如此，而且这样的功能在内存和处理能力上会相当密集。您需要拥有巨大的文件才能识别所有单词等。很高兴知道您为什么需要它，以便尝试找到不同的解决方案。

如果您有某种显示网站信息的服务，则显示“Snapnames.com”是完全可以接受的。没有必要大写它，或类似的东西。

然而，如果你对这种行为下定决心，即使它不是 100% 准确，而且在你的服务器上相当激烈......

您首先需要找到一种方法来检查字符串是否为单词。这是一个完全不同的问题，答案完全合理。您需要单独询问，看看您是否可以找到 PHP 的字典库。

基本上，向后迭代你的字符串，直到它变成一个单词，从字符串中删除那个单词，然后重复。例如：

expertsexchange.com，你会这样检查：

第一个 {} 是您的单词列表。第一个“”是您要检查的所有字母最后一个“”是您正在检查的当前字母子集

{} "expertsexchange" "expertsexchange" <-- not a word
{} "expertsexchange" "expertsexchang" <-- not a word
{} "expertsexchange" "expertsexchan" <-- not a word
{} "expertsexchange" "expertsexcha" <-- not a word
{} "expertsexchange" "expertsexch" <-- not a word
{} "expertsexchange" "expertsexc" <-- not a word
{} "expertsexchange" "expertsex" <-- not a word
{} "expertsexchange" "expertse" <-- not a word
{} "expertsexchange" "experts" <-- WORD! Add it to our list of words
{"experts"} "exchange" "exchange" <-- WORD! Add it to our list of words
{"experts", "exchange"} "" "" <-- No more letters to check, we have found all of our words.

让我们尝试一个不同的例子......

地狱猫猫。这有一个字典无法识别的“单词”（“wittle”）。不幸的是，算法就是这样处理的：

{} "hellotherewittlekitty" "hellotherewittlekitty" <-- not a word
{} "hellotherewittlekitty" "hellotherewittlekitt" <-- not a word
{} "hellotherewittlekitty" "hellotherewittlekit" <-- not a word
{} "hellotherewittlekitty" "hellotherewittleki" <-- not a word
{} "hellotherewittlekitty" "hellotherewittlek" <-- not a word
{} "hellotherewittlekitty" "hellotherewittle" <-- not a word
{} "hellotherewittlekitty" "hellotherewittl" <-- not a word
{} "hellotherewittlekitty" "hellotherewitt" <-- not a word
{} "hellotherewittlekitty" "hellotherewit" <-- not a word
{} "hellotherewittlekitty" "hellotherewi" <-- not a word
{} "hellotherewittlekitty" "hellotherew" <-- not a word
{} "hellotherewittlekitty" "hellothere" <-- not a word
{} "hellotherewittlekitty" "hellother" <-- not a word
{} "hellotherewittlekitty" "hellothe" <-- not a word
{} "hellotherewittlekitty" "helloth" <-- not a word
{} "hellotherewittlekitty" "hellot" <-- not a word
{} "hellotherewittlekitty" "hello" <-- WORD! add it to list, and remove form main string!
{"hello"} "therewittlekitty" "therewittlekitty" <-- not a word
{"hello"} "therewittlekitty" "therewittlekitt" <-- not a word
{"hello"} "therewittlekitty" "therewittlekit" <-- not a word
{"hello"} "therewittlekitty" "therewittleki" <-- not a word
{"hello"} "therewittlekitty" "therewittlek" <-- not a word
{"hello"} "therewittlekitty" "therewittle" <-- not a word
{"hello"} "therewittlekitty" "therewittl" <-- not a word
{"hello"} "therewittlekitty" "therewitt" <-- not a word
{"hello"} "therewittlekitty" "therewit" <-- not a word
{"hello"} "therewittlekitty" "therew" <-- not a word
{"hello"} "therewittlekitty" "there" <-- WORD! add it to list, and remove from main string
{"hello", "there"} "wittlekitty" "wittlekitty" <-- not a word
{"hello", "there"} "wittlekitty" "wittlekitt" <-- not a word
{"hello", "there"} "wittlekitty" "wittlekit" <-- not a word
{"hello", "there"} "wittlekitty" "wittleki" <-- not a word
{"hello", "there"} "wittlekitty" "wittlek" <-- not a word
{"hello", "there"} "wittlekitty" "wittle" <-- not a word (even though humans read it as one)
{"hello", "there"} "wittlekitty" "wittl" <-- not a word
{"hello", "there"} "wittlekitty" "witt" <-- WORD! add to dictionary and remove from string
{"hello", "there", "witt"} "lekitty" "lekitty" <-- not a word
{"hello", "there", "witt"} "lekitty" "lekitt" <-- not a word
{"hello", "there", "witt"} "lekitty" "lekit" <-- not a word
{"hello", "there", "witt"} "lekitty" "leki" <-- WORD! (biology, wikipedia)
{"hello", "there", "witt", "leki"} "tty" "tty" <-- not a word
{"hello", "there", "witt", "leki"} "tty" "tt" <-- not a word
{"hello", "there", "witt", "leki"} "tty" "t" <-- not a word
{"hello", "there", "witt", "leki"} "tty" "" <-- No more letters, add it to the list!
{"hello", "there", "witt", "leki", "tty"} "" ""

因此，hellotherewittlekitty 会以 HelloThereWittLekiTty 的形式出现，这比仅将其全部小写更糟糕。

还有其他算法在您的 CPU 上比这更密集，并且需要更多数据，这可能会为您提供更高的准确性。但总而言之，对于所有的工作，只获得 30% 的准确率是不值得的。特别是因为当算法失败时，它会毁了你的话。这意味着添加它会使您 60% 的网站被毁。

php - 用于查找/识别域名中的单词的 PHP 脚本

1 回答 1

Related

Reference