7

我编写了这个 PHP 代码来实现 Flesch-Kincaid 可读性分数作为一个函数:

function readability($text) {
    $total_sentences = 1; // one full stop = two sentences => start with 1
    $punctuation_marks = array('.', '?', '!', ':');
    foreach ($punctuation_marks as $punctuation_mark) {
        $total_sentences += substr_count($text, $punctuation_mark);
    }
    $total_words = str_word_count($text);
    $total_syllable = 3; // assuming this value since I don't know how to count them
    $score = 206.835-(1.015*$total_words/$total_sentences)-(84.6*$total_syllables/$total_words);
    return $score;
}

您对如何改进代码有什么建议吗?这是正确的吗?它会起作用吗?

我希望你能帮助我。提前致谢!

4

4 回答 4

16

就启发式而言,代码看起来很好。以下是需要考虑的一些要点,它们会使您需要计算的项目对于机器来说相当困难:

  1. 什么是句子?

    说真的,什么是句子?我们有句号,但它们也可以用于Ph.D.,例如,YMCA,以及其他非句子结尾的目的。当您考虑感叹号、问号和省略号时,假设句号可以解决问题,实际上是在伤害自己。我以前看过这个问题,如果你真的想要更可靠的真实文本中的句子计数,你需要解析文本。这可能是计算密集型、耗时且难以找到免费资源的。最后,您仍然需要担心特定解析器实现的错误率。然而,只有完整的解析才能告诉你什么是句子,什么是句号的其他许多用途。此外,如果您使用“在野外”的文本——例如 HTML——你 re 还必须担心句子不是以标点符号结尾,而是以标签结尾。例如,许多网站不会在 h1 和 h2 标签中添加标点符号,但它们显然是不同的句子或短语。

  2. 音节不是我们应该近似的东西

    这是这种可读性启发式的主要标志,也是最难实现的标志。对作品中音节计数的计算分析需要假设假定的读者说的方言与您的音节计数生成器正在接受的训练相同。声音如何落在音节周围实际上是使重音成为重音的主要部分。如果你不相信我,试着找个时间去牙买加看看。这意味着即使人类手动进行计算,它仍然是特定方言的分数。

  3. 什么是词?

    一点也不夸张,但你会发现空格分隔的词和对说话者来说概念化的词是完全不同的。这将使可计算可读性分数的概念有些可疑。

所以最后,我可以回答你的“它会起作用”的问题。如果您希望获取一段文本并在其他指标中显示此可读性分数以提供某种可以想象的附加值,那么有眼光的用户不会提出所有这些问题。如果您正在尝试做一些科学的事情,甚至是教学上的事情(因为这个分数和类似的分数最终是有意的),我真的不会打扰。事实上,如果你打算用它来向用户提出关于他们生成的内容的任何建议,我会非常犹豫。

衡量文本阅读难度的更好方法很可能是与低频词与高频词的比率以及文本中hapax legomena的数量有关。但我不会追求真正想出这样的启发式方法,因为要凭经验测试类似的东西是非常困难的。

于 2009-07-02T22:14:18.403 回答
8

看看 GitHub 上的PHP Text Statistics类。

于 2009-09-07T04:47:30.950 回答
7

请查看以下两个类及其使用信息。它肯定会帮助你。

可读性音节计数模式库类:

<?php class ReadabilitySyllableCheckPattern {

public $probWords = [
    'abalone' => 4,
    'abare' => 3,
    'abed' => 2,
    'abruzzese' => 4,
    'abbruzzese' => 4,
    'aborigine' => 5,
    'acreage' => 3,
    'adame' => 3,
    'adieu' => 2,
    'adobe' => 3,
    'anemone' => 4,
    'apache' => 3,
    'aphrodite' => 4,
    'apostrophe' => 4,
    'ariadne' => 4,
    'cafe' => 2,
    'calliope' => 4,
    'catastrophe' => 4,
    'chile' => 2,
    'chloe' => 2,
    'circe' => 2,
    'coyote' => 3,
    'epitome' => 4,
    'forever' => 3,
    'gethsemane' => 4,
    'guacamole' => 4,
    'hyperbole' => 4,
    'jesse' => 2,
    'jukebox' => 2,
    'karate' => 3,
    'machete' => 3,
    'maybe' => 2,
    'people' => 2,
    'recipe' => 3,
    'sesame' => 3,
    'shoreline' => 2,
    'simile' => 3,
    'syncope' => 3,
    'tamale' => 3,
    'yosemite' => 4,
    'daphne' => 2,
    'eurydice' => 4,
    'euterpe' => 3,
    'hermione' => 4,
    'penelope' => 4,
    'persephone' => 4,
    'phoebe' => 2,
    'zoe' => 2
];

public $addSyllablePatterns = [
    "([^s]|^)ia",
    "iu",
    "io",
    "eo($|[b-df-hj-np-tv-z])",
    "ii",
    "[ou]a$",
    "[aeiouym]bl$",
    "[aeiou]{3}",
    "[aeiou]y[aeiou]",
    "^mc",
    "ism$",
    "asm$",
    "thm$",
    "([^aeiouy])\1l$",
    "[^l]lien",
    "^coa[dglx].",
    "[^gq]ua[^auieo]",
    "dnt$",
    "uity$",
    "[^aeiouy]ie(r|st|t)$",
    "eings?$",
    "[aeiouy]sh?e[rsd]$",
    "iell",
    "dea$",
    "real",
    "[^aeiou]y[ae]",
    "gean$",
    "riet",
    "dien",
    "uen"
];

public $prefixSuffixPatterns = [
    "^un",
    "^fore",
    "^ware",
    "^none?",
    "^out",
    "^post",
    "^sub",
    "^pre",
    "^pro",
    "^dis",
    "^side",
    "ly$",
    "less$",
    "some$",
    "ful$",
    "ers?$",
    "ness$",
    "cians?$",
    "ments?$",
    "ettes?$",
    "villes?$",
    "ships?$",
    "sides?$",
    "ports?$",
    "shires?$",
    "tion(ed)?$"
];

public $subSyllablePatterns = [
    "cia(l|$)",
    "tia",
    "cius",
    "cious",
    "[^aeiou]giu",
    "[aeiouy][^aeiouy]ion",
    "iou",
    "sia$",
    "eous$",
    "[oa]gue$",
    ".[^aeiuoycgltdb]{2,}ed$",
    ".ely$",
    "^jua",
    "uai",
    "eau",
    "[aeiouy](b|c|ch|d|dg|f|g|gh|gn|k|l|ll|lv|m|mm|n|nc|ng|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y|z)e$",
    "[aeiouy](b|c|ch|dg|f|g|gh|gn|k|l|lch|ll|lv|m|mm|n|nc|ng|nch|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|th|v|y|z)ed$",
    "[aeiouy](b|ch|d|f|gh|gn|k|l|lch|ll|lv|m|mm|n|nch|nn|p|r|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y)es$",
    "^busi$"
]; } ?>

另一类是可读性算法类,有两种计算分数的方法:

<?php class ReadabilityAlgorithm {
function countSyllable($strWord) {
    $pattern = new ReadabilitySyllableCheckPattern();
    $strWord = trim($strWord);

    // Check for problem words
    if (isset($pattern->{'probWords'}[$strWord])) {
        return $pattern->{'probWords'}[$strWord];
    }

    // Check prefix, suffix
    $strWord = str_replace($pattern->{'prefixSuffixPatterns'}, '', $strWord, $tmpPrefixSuffixCount);

    // Removed non word characters from word
    $arrWordParts = preg_split('`[^aeiouy]+`', $strWord);
    $wordPartCount = 0;
    foreach ($arrWordParts as $strWordPart) {
        if ($strWordPart <> '') {
            $wordPartCount++;
        }
    }
    $intSyllableCount = $wordPartCount + $tmpPrefixSuffixCount;

    // Check syllable patterns 
    foreach ($pattern->{'subSyllablePatterns'} as $strSyllable) {
        $intSyllableCount -= preg_match('`' . $strSyllable . '`', $strWord);
    }

    foreach ($pattern->{'addSyllablePatterns'} as $strSyllable) {
        $intSyllableCount += preg_match('`' . $strSyllable . '`', $strWord);
    }

    $intSyllableCount = ($intSyllableCount == 0) ? 1 : $intSyllableCount;
    return $intSyllableCount;
}

function calculateReadabilityScore($stringText) {
    # Calculate score
    $totalSentences = 1;
    $punctuationMarks = array('.', '!', ':', ';');

    foreach ($punctuationMarks as $punctuationMark) {
        $totalSentences += substr_count($stringText, $punctuationMark);
    }

    // get ASL value
    $totalWords = str_word_count($stringText);
    $ASL = $totalWords / $totalSentences;

    // find syllables value
    $syllableCount = 0;
    $arrWords = explode(' ', $stringText);
    $intWordCount = count($arrWords);
    //$intWordCount = $totalWords;

    for ($i = 0; $i < $intWordCount; $i++) {
        $syllableCount += $this->countSyllable($arrWords[$i]);
    }

    // get ASW value
    $ASW = $syllableCount / $totalWords;

    // Count the readability score
    $score = 206.835 - (1.015 * $ASL) - (84.6 * $ASW);
    return $score;
} } ?>

// 示例:如何使用

<?php // Create object to count readability score
$readObj = new ReadabilityAlgorithm();
echo $readObj->calculateReadabilityScore("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into: electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently; with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum!");
?>
于 2017-01-25T05:30:49.650 回答
0

我实际上没有看到该代码有任何问题。当然,如果您真的想通过用单个计数循环替换所有不同的功能,可以对其进行一些优化。但是,我强烈认为这没有必要,甚至是完全错误的。您当前的代码非常易读且易于理解,从这个角度来看,任何优化都可能会使事情变得更糟。照原样使用它,除非它实际上成为性能瓶颈,否则不要尝试优化它。

于 2009-07-02T21:51:34.737 回答