36

我在数据库中有一本脏话词典,下面的效果很好

preg_match_all("/\b".$f."(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

$t是输入文本,简单地说,$f = preg_quote("punk"); "punk"是来自数据库字典,所以此时循环中的表达式如下

preg_match_all("/\bpunk(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

preg_quote替换符号,例如。#with\\#以便表达式被转义,但是当字典正在检查时。"F@CK"或者"A$$"在使用上述表达式的输入字符串中未检测到这些符号,我在字典中都有a$$f@ck但它们不起作用。如果我删除preg_quote()这个词,正则表达式是无效的,因为这些符号没有被转义。

关于如何检测的任何建议"a$$"???

编辑:

所以我猜想没有按预期工作的表达式是例如。

preg_match_all("/\bf\@ck(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

哪个应该找到f @ck$t

更新:

这是我的用法,简单地说;如果在$m用 替换它们中有匹配项"\*\*\*\*",则整个块在字典中每个单词的循环内,$f是字典单词并且$t是输入

$f = preg_quote($f);
preg_match_all("/\b$f(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);
if (count($m) > 0) {
     $t = preg_replace("/(\b$f(?:ing|er|es|s)?\b)/si","\*\*\*\*\*",$t);
}

更新:看哪,var_dump

preg_quote($f) = string(5) "a\$\$"
$t = string(18) "You're such an a$$"
expression = string(29) "/\ba\$\$(?:ing|er|es|s)?\b/si"

更新:仅当单词以符号结尾时才会发生这种情况。我测试过,没问题"a$$hole",但"a$$"不起作用。

另一个更新:试试这个简化版本,$words作为临时字典

$words = array("a$$","asshole","a$$hole","f@ck","f#ck","f*ck");
$text = "Input whatever you feel like here eg. a$$";

foreach ($words as $f) {
   $f = preg_quote($f,"/");
   $text = preg_replace("/\b".$f."(?:ing|er|es|s)?\b/si",
                         str_repeat("*",strlen($f)),
                        $t);
}

我应该期待看到"Input whatever you feel like here eg. \*\*\*"结果。

4

3 回答 3

188

无法完成

对不起,这个“问题”确实是不可能解决的。考虑这些:

  • ꜰᴜᴄᴋ is U+A730.1D1C.1D04.1D0B, "\N{拉丁字母小写字母 F}\N{拉丁字母小写字母 U}\N{拉丁字母小写字母 C}\N{拉丁字母小写字母 K}"
  • ᶠᵘᶜᵏ 是 U+1DA0.1D58.1D9C.1D4F, "\N{修饰字母小 F}\N{修饰字母小 U}\N{修饰字母小 C}\N{修饰字母小 K}"
  •   是 U+1D4BB.1D4CA.1D4B8.1D4C0, "\N{小数学脚本 F}\N{小数学脚本 U}\N{小数学脚本 C}\N{小数学脚本 K}"
  •   is U+1D58B.1D59A.1D588.1D590, "\N{数学粗体 FRAKTUR 小 F}\N{数学粗体 FRAKTUR 小 U}\N{数学粗体 FRAKTUR 小 C}\N{数学粗体 FRAKTUR 小 K}"
  •   是 U+1D4D5.1D4B0.1D49E.1D4A6, "\N{数学粗体大写 F}\N{数学大写 U}\N{数学大写 C}\N{数学大写 K}"
  • ⓕ ⓤ ⓒ ⓚ 是 U+24D5.24E4.24D2.24DA, "\N{带圆圈的拉丁小写字母 F}\N{带圆圈的拉丁小写字母 U}\N{带圆圈的拉丁小写字母 C}\N{带圆圈的拉丁小写字母克}"
  • Γ̵ᏟᏦ 是 U+393.335.10335.13DF.13E6,“\N{希腊大写字母 GAMMA}\N{组合短笔划叠加}\N{哥特式字母 QAIRTHRA}\N{切诺基字母 TLI}\N{切诺基字母 TSO}”
  • ƒμɕѤ 为 U+192.3BC.255.464, "\N{带钩的拉丁小写字母 F}\N{希腊小写字母 MU}\N{带卷曲的拉丁小写字母 C}\N{西里尔大写字母 E}"
  • Г̵ЦСК 是 U+413.335.426.421.41A, "\N{西里尔大写字母 GHE}\N{组合短笔划覆盖}\N{西里尔大写字母 TSE}\N{西里尔大写字母 ES}\N{西里尔大写字母 KA }"
  • ғᵾȼƙ is U+493.1D7E.23C.199, "\N{西里尔小写字母 GHE 带笔划}\N{拉丁小写字母 U 带笔划}\N{拉丁小写字母 C 带笔划}\N{拉丁小写字母 K带钩}"
  • ϜυϚΚ 是 U+3DC.3C5.3DA.39A, "\N{希腊字母 DIGAMMA}\N{希腊小写字母 UPSILON}\N{希腊字母 STIGMA}\N{希腊大写字母 KAPPA}"
  • ЖↃUᆿ 是 U+416.2183.55.11BF, "\N{西里尔大写字母 ZHE}\N{罗马数字反转一百}\N{拉丁大写字母 U}\N{HANGUL JONGSEONG KHIEUKH}"
  • ʞɔnɟ is U+29E.254.6E.25F, "\N{拉丁小写字母转 K}\N{拉丁小写字母开 O}\N{拉丁小写字母 N}\N{拉丁小写字母无点 J WITH STROKE}"

情况变得更糟

如果您认为这些很容易,请尝试应对所有这些:

 00 Ↄ ʞ, F ᵾ ⒞ K, K ⓒ Ц ⒡ , K , ғ ∞ Ϛ k, f Ꮯ K, ⓕ oo ɔ ⓚ , ɟ ⒰ ¢ K, ȼ , Ù ȼ ⒦ , f ⒞ƙ, ᶜ , F ∞ Ж , @ Ꮯ , ɟ ᵘ , F Ц ¢ , foo Ꮯ ʞ, oo ¢ Ж , υ ᶜ Κ , Ϝ ú * ʞ, ꜰ c K, ƒ ᵘ ȼ k, U ȼ , Ж ɔ μ ƒ, F ⓤ ⒞ k, ƒ Cƙ, ғ 00 ɔ Ѥ, ƒ U c ᴋ, ∞ Ꮶ ⓒ , ꜰ ᴄ ⒦ , ⒰ Ꮯ Ѥ, ꜰ ᴜ ⒦ , F ʞ, f 00 , ғ u С K, f ɔ Κ, f μ Ↄ K, ɟ c ʞ, f Ↄ , F μ ¢ , ᆿ ᴄ ⒦ , Κ ¢ oo ɟ, ᶠ μ ᶜ Ѥ, ᶠ ⓤ Ꮯ Ж , ⒞ ᵘ F, F @ C ⓚ , Ѥ ᴄ u F, ⒡ ᵾ k , ƒ μ ᶜ ᴋ, F C , f ᵘ ¢ ᵏ, ᆿ 00 , ꜰ υ ȼ K, Ϝ ȼ К , oo ɕ ᴋ, ғ Ꮯ ᴋ, ꜰ n K, ꜰ μ Ϛ К , F ∞ ȼ , ⒡ Ↄ Κ , ƒ ⒞ , ᶠ U C Ꮶ, ᶠ υ Ↄ ƙ, C , Ϝ Ѥ, Ϝ U Ↄ , U ⒞ ᵏ, F @ C К , ғ ᴜ ᴋ, ⒡ UК , ɟU * ᵏ, Ц c Κ, ғ U Ↄ , ƒ ⒰ ᵏ, ғ * K, n ⓚ , ᶠ 00 С К , Ц k, ƙ c Ц ᶠ, ⒰ Ѥ , ꜰ ǔ ᴄ ⒦ , F Ↄ , υ ꜰ, * ᵏ, 00 Ж , Κ C , ᶠ U С K, ꜰ Κ, ɟ U ᶜ ⓚ , ∞ ȼ ᴋ, ƒ U К ć, ƒ υ ȼ ᴋ, ⒡ ∞ Ж ɕ, ᵘ ᵏ, F U Ϛ ʞ, ⓕ Ж , Ↄ, Ϝ n * K,oo c ⓚ , ƒ U ¢ ʞ, ƒ u C ʞ, K ¢ μ ⒡ , ɟ ⒰ K ɔ, F U c k, F Ц ⓚ , U ᴋ ɔ, Ꮯ , ⓚ , ⓕ C К , ɟ ᵾ * ⒦ , ᶠ ᵘ ⒞ ⒦ , ƒ ⒰ ᴄ ᵏ, ⒡ ⒰ С K, ⒰ * ᴋ, ᆿ ∞ ʞ ɕ, n * Ѥ, Ϝ μ ᴄ , k ć ᵘ ƒ, ᵘ ɕ , ɟ Ц Ꮨ ᴄ, ᵓ ᵏ, ⒞ ᵏ, ᵏ, ᵾ * Ѥ, F Ꮯ K, ғ ⓤ ᴋ, ƒ u ɕ , ƙ c ⒰ F, ⓒ Κ, K ᶜ Ц , ɟ c ⒦ , ƒ @ c Κ, Ϝ Ц ȼ Ḱ, ⒡ ᵘ ⒦ , ɟ ᵾ Ѥ ¢, F Ↄ , Ϝ ᴜ , Ϝ ⒞ , U Ꮯ ʞ, ƒ υ Ꮯ ᵏ, F ᵾ Ꮯ Κ, Ϝ ᵘ ⓒ ʞ, ⓤ ᶜ ƙ, ᆿ ⒞ , f ↰ Ѥ, U K, Ϝ ᴜ * @ ⓒ ʞ, ƒ u ⓒ , f u ⒞ k, 00 ᴄ Ѥ, υ С K, F ᴜ ᴄ , ⓕ oo Ↄ ⓚ , ⒡ ᵘ ɕ , ⓕ υ ᴄ Κ, ᆿ U Ꮯ , Ꮯ ɏ, Ć , К , f @ Ↄ ⓚ , ᴋ ᶜ U ꜰ, ᴜ c ⒦ , F ᵘ C , 00 Ꮶ, ꜰ 00 К , Ϝ Ϛ ᵏ, F c Ѥ, ⓕ oo Ↄ K, f ᵾ С ᵏ, ⓕ Ц c , c Ж , ⓕ ƙ, ⓚ C n ғ, ɟ U ȼ , 00 K ȼ, ᴄ , Ц Ç , Ц ¢ , Ϝ ᵘ c k, ⒡ ¢ k, ƒ ⓤ ⓚ Ↄ, k, ƒ U Ↄ K, ᴄ Ꮶ, ᆿ ⓤ ⒦ , Ж ɔ U , ƒ υ * ᴋ, ƒ k, U С ⒦ , C Ж , ƒ μ Ꮯ ƙ, ⓕ n ᴄ ⒦ , ⓕ μ ⓒ Ж , ⒡ 00 ɕ , ᴜ ᶜ , ᆿ Ù Ж ,⒦ Ѥ , k C ⓤ ᆿ, Ϝ n ȼ ᵏ, ᴋ ȼ ᵾ ɟ, F ȼ Ѥ, ғ ⒰ ȼ , f U Ж ⒞ , F ῠ ᵏ, F u Κ, F 00 ȼ , ꜰ μ Ϛ Ꮶ, ᆿ K, ⒡ n Ↄ Ж , F @ƙ, ᶠ ὺ К , U C ᵏ, F U ⒦ , 00 Ↄ , ᶠ c К , ғ ⓤ , ⓤ Κ, U Ж , ⒡ ɔ Ꮶ, ⓚ ɔ f, U C K, F @ C Ѥ, ғ ᴜ С k, ɟ u *ƙ, ⓕ ᵾ ɕ , 00 ȼ K, υ , ƒ ⒰ * ʞ, ⓕ U Ↄ Ж , ꜰ U ƙ, ⒡ u С ⒦ , ꜰ ᴜ Ќ, ᆿ μ ⒦ , ⓕ @ ᴄ К , ᶠ υ ɔ ᵏ, ƙ Ↄ oo ꜰ, F ᴜ , ⒰ C ᵏ, U ƙ, ƒ ∞ C Ꮶ, ⒰ * K, u Ↄ ᴋ, ᆿ U ⓒ , ᆿ U Ꮶ , n , ƒ Ц Cƙ, ⒦ ꜰ, K ¢ ᵘ f, ⒰ Ꮶ, ᴄ 00 , Ϝ U k, u ¢ ⒦, *Ѥ, ƒ С ᴋ, C Ꮶ, @ Κ, ʞ С ᶠ, ᵾ Ϛ Ꮶ, ᶠ ⒰ ɔ , F Ц ⒞ ʞ, ⒡ ⒰ К ɔ, ɟ υ ¢ , Ѥ ȼ U ᆿ, ᴜ Ↄ ʞ, ғ * K, ᴄ ʞ, F ʞ, @ ȼ , ⒰ * , ᵾ ȼ , F ¢ Ѥ, ꜰ ⓤƙ Ϛ, ⓕ 00 c ʞ, 00 Ϛ K, υ Ↄ Κ, ꜰ μ ⓒ Ж , ᵘ Ϛ ʞ, Ϝ ᵘ Ↄ ᵏ, ⒡ ᵾ Ꮯ , Ϝ ⒰ Ȧ Ѥ, ƒ n Ѥ, ᆿ μ ⓒ ɕ Κ, ғ μ Ѥ, f ⓤ Ꮯ , ᵏ μ ƒ, ᵏ С , ᆿ ∞ , ғ ᵘ Ꮯ , ƒ μ Ↄ k, f oo K ȼ, ɟ С , ꜰ n K, 00 ᵏ, ᶠ μ ⓒ ,c ∞ Ϝ, ᆿ Ц Ć ⒦ , ᵘ ᴄ , F 00 ⓚ , ᶠ @ ȼ К , ...

这还不是全部:至少还有一个 bazingatillion 更多这些来自哪里。你现在明白为什么这根本无法做到了吗?

全面披露

因为我不相信默默无闻的安全性,所以这是生成所有这些的程序:

#!/usr/bin/env perl
#
# unifuck - print infinite permutations of fuck in unicode aliases
#
# Tom Christiansen <tchrist@perl.com>
# Mon May 23 09:37:27 MDT 2011

use strict;
use warnings;
use charnames ":full";

use Unicode::Normalize;

binmode(STDOUT, ":utf8");

our(@diddle, @fuck, %fuck); # initted down below
while (my($f,$u,$c,$k) = splice(@fuck, 0, 4)) {
    $fuck{F}{$f}++;
    $fuck{U}{$u}++;
    $fuck{C}{$c}++;
    $fuck{K}{$k}++;
} 

my @F = keys %{ $fuck{F} };
my @U = keys %{ $fuck{U} };
my @C = keys %{ $fuck{C} };
my @K = keys %{ $fuck{K} };

while (1) { 
    my $f = $F[rand @F];
    my $u = $U[rand @U];
    my $c = $C[rand @C];
    my $k = $K[rand @K];

    for ($f,$u,$c,$k) {  
        next if length > 1;
        next if /\p{EA=W}/;
        next if /\pM/;
        next if /\p{InEnclosedAlphanumerics}/;
        s/$/$diddle[rand @diddle]/          if rand(100) < 15;
        s/$/\N{COMBINING ENCLOSING KEYCAP}/ if rand(100) <  1;
    }

    if    (             0) {                                       }
    elsif (rand(100) <  5) {     $u        = q(@)                  } 
    elsif (rand(100) <  5) {        $c     = q(*)                  } 
    elsif (rand(100) < 10) {       ($c,$k) = ($k,$c)               } 
    elsif (rand(100) < 15) { ($f,$u,$c,$k) = reverse ($f,$u,$c,$k) }

    print NFC("$f $u $c $k\n");
}

BEGIN {

    # ok to have repeats in each position, since they'll be counted only once
    # per unique strings
    @fuck = (

        "\N{LATIN CAPITAL LETTER F}",
        "\N{LATIN CAPITAL LETTER U}",
        "\N{LATIN CAPITAL LETTER C}",
        "\N{LATIN CAPITAL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{LATIN SMALL LETTER U}",
        "\N{LATIN SMALL LETTER C}",
        "\N{LATIN SMALL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{INFINITY}",
        "\N{LATIN SMALL LETTER C}",
        "\N{LATIN SMALL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{LATIN SMALL LETTER O}\N{LATIN SMALL LETTER O}",
        "\N{LATIN SMALL LETTER C}",
        "\N{KELVIN SIGN}",

        "\N{LATIN SMALL LETTER F}",
        "\N{DIGIT ZERO}\N{DIGIT ZERO}",
        "\N{CENT SIGN}",
        "\N{LATIN CAPITAL LETTER K}",

        "\N{LATIN LETTER SMALL CAPITAL F}",
        "\N{LATIN LETTER SMALL CAPITAL U}",
        "\N{LATIN LETTER SMALL CAPITAL C}",
        "\N{LATIN LETTER SMALL CAPITAL K}",

        "\N{MODIFIER LETTER SMALL F}",
        "\N{MODIFIER LETTER SMALL U}",
        "\N{MODIFIER LETTER SMALL C}",
        "\N{MODIFIER LETTER SMALL K}",

        "\N{MATHEMATICAL SCRIPT SMALL F}",
        "\N{MATHEMATICAL SCRIPT SMALL U}",
        "\N{MATHEMATICAL SCRIPT SMALL C}",
        "\N{MATHEMATICAL SCRIPT SMALL K}",

        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL F}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL U}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL C}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL K}",

        "\N{MATHEMATICAL BOLD FRAKTUR SMALL F}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL U}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL C}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL K}",

        "\N{MATHEMATICAL BOLD SCRIPT CAPITAL F}",
        "\N{MATHEMATICAL SCRIPT CAPITAL U}",
        "\N{MATHEMATICAL SCRIPT CAPITAL C}",
        "\N{MATHEMATICAL SCRIPT CAPITAL K}",

        "\N{CIRCLED LATIN SMALL LETTER F}",
        "\N{CIRCLED LATIN SMALL LETTER U}",
        "\N{CIRCLED LATIN SMALL LETTER C}",
        "\N{CIRCLED LATIN SMALL LETTER K}",

        "\N{PARENTHESIZED LATIN SMALL LETTER F}",
        "\N{PARENTHESIZED LATIN SMALL LETTER U}",
        "\N{PARENTHESIZED LATIN SMALL LETTER C}",
        "\N{PARENTHESIZED LATIN SMALL LETTER K}",

        "\N{GREEK CAPITAL LETTER GAMMA}\N{COMBINING SHORT STROKE OVERLAY}",
        "\N{GOTHIC LETTER QAIRTHRA}",
        "\N{CHEROKEE LETTER TLI}",
        "\N{CHEROKEE LETTER TSO}",

        "\N{LATIN SMALL LETTER F WITH HOOK}",
        "\N{GREEK SMALL LETTER MU}",
        "\N{LATIN SMALL LETTER C WITH CURL}",
        "\N{CYRILLIC CAPITAL LETTER IOTIFIED E}",

        "\N{CYRILLIC CAPITAL LETTER GHE}\N{COMBINING SHORT STROKE OVERLAY}",
        "\N{CYRILLIC CAPITAL LETTER TSE}",
        "\N{CYRILLIC CAPITAL LETTER ES}",
        "\N{CYRILLIC CAPITAL LETTER KA}",

        "\N{CYRILLIC SMALL LETTER GHE WITH STROKE}",
        "\N{LATIN SMALL CAPITAL LETTER U WITH STROKE}",
        "\N{LATIN SMALL LETTER C WITH STROKE}",
        "\N{LATIN SMALL LETTER K WITH HOOK}",

        "\N{GREEK LETTER DIGAMMA}",
        "\N{GREEK SMALL LETTER UPSILON}",
        "\N{GREEK LETTER STIGMA}",
        "\N{GREEK CAPITAL LETTER KAPPA}",

        "\N{HANGUL JONGSEONG KHIEUKH}",
        "\N{LATIN CAPITAL LETTER U}",
        "\N{ROMAN NUMERAL REVERSED ONE HUNDRED}",
        "\N{CYRILLIC CAPITAL LETTER ZHE}",

        "\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}",
        "\N{LATIN SMALL LETTER N}",
        "\N{LATIN SMALL LETTER OPEN O}",
        "\N{LATIN SMALL LETTER TURNED K}",

        "\N{FULLWIDTH LATIN CAPITAL LETTER F}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER U}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER C}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER K}",

    );

    @diddle = (
        "\N{COMBINING GRAVE ACCENT}",
        "\N{COMBINING ACUTE ACCENT}",
        "\N{COMBINING CIRCUMFLEX ACCENT}",
        "\N{COMBINING TILDE}",
        "\N{COMBINING BREVE}",
        "\N{COMBINING DOT ABOVE}",
        "\N{COMBINING DIAERESIS}",
        "\N{COMBINING CARON}",
        "\N{COMBINING CANDRABINDU}",
        "\N{COMBINING INVERTED BREVE}",
        "\N{COMBINING GRAVE TONE MARK}",
        "\N{COMBINING ACUTE TONE MARK}",
        "\N{COMBINING GREEK PERISPOMENI}",
        "\N{COMBINING FERMATA}",
        "\N{COMBINING SUSPENSION MARK}",
    );

}
于 2011-05-23T15:46:30.740 回答
4

\b检查单词边界。根据http://www.regular-expressions.info/wordboundaries.html

有资格作为单词边界的三个不同位置:

  • 在字符串的第一个字符之前,如果第一个字符是单词字符。
  • 在字符串的最后一个字符之后,如果最后一个字符是单词字符。
  • 在字符串中的两个字符之间,其中一个是单词字符,另一个不是单词字符。

“单词字符”是字母、数字和下划线,所以在字符串“a$$”中,单词边界出现在“a”之后,而不是在第二个“$”之后。

您可能需要通过使用类(例如,[- '"])明确指定您认为是“单词边界”的字符。

于 2011-05-23T14:05:18.200 回答
2

现在,当您在单词末尾说它不起作用时,我看到了问题。$@或任何其他此类特殊字符不是单词的一部分(因此\b,如果输入字符串中没有任何其他字母,则在 'a$$' 的情况下,在 'a' 之后打破单词)。我建议使用[^a-z]标记单词的结尾来修复它。

preg_match_all("/\b".$f."(?:ing|er|es|s)?[^a-z]/si",$t,$m,PREG_SET_ORDER);
于 2011-05-23T11:54:18.810 回答