1

我正在开发一个反垃圾邮件机器人,它很难解码同形字。

这是一条示例消息:

ɪ ᴄᴀɴ'ᴛ ꜱᴛᴏᴘ ꜱʜᴀʀɪɴɢ ᴛʜᴇ ɢᴏᴏᴅ ɴᴇᴡꜱ ᴀʙᴏᴜᴛ ꜰᴏʀᴇx ᴍᴀʀᴋᴇᴛ ᴄᴏᴍᴘᴀɴʏ.
ᴡʜᴇɴ ɪ ꜰɪʀꜱᴛ ʜᴇᴀʀᴅ ɪᴛ, ɪ ᴡᴀꜱ ᴀꜰʀᴀɪᴅ ʙᴜᴛ ʟᴀᴛᴇʀ ꜱᴜᴍᴍᴏɴᴇᴅ ᴄᴏᴜʀᴀɢᴇ ᴀɴᴅ ᴍᴀᴅᴇ ᴀ ᴍᴏᴠᴇ ᴡɪᴛʜ $200
ɪ ꜱᴛɪʟʟ ᴄᴀɴ'ᴛ ʙᴇʟɪᴇᴠᴇ ᴛʜᴇ ᴘʟᴀᴛꜰᴏʀᴍ ɪꜱ ꜱo ʀᴇᴀʟ ᴜɴᴛɪʟ ɪ ʀᴇᴄᴇɪᴠᴇᴅ $3,100 IN 48HOURS of trade ᴀꜱ ᴍʏ ᴘʀᴏꜰɪᴛ
ᴛʜɪꜱ ɪꜱ ʏᴏᴜʀ ᴍᴏᴍᴇɴᴛ ᴏꜰ ʀᴇᴅᴇᴍᴘᴛɪᴏɴ ᴊᴜꜱᴛ ᴏɴᴇ ᴄʟɪᴄᴋ ᴀᴡᴀʏ ꜰʀᴏᴍ ɢʀᴇᴀᴛɴᴇꜱꜱ, ᴍᴀᴋᴇ ᴀ ᴍᴏᴠᴇ ɴᴏᴡ ʟᴇᴛ ʜɪꜱᴛᴏʀʏ ʙᴇ ᴍᴀᴅᴇ
ʜᴇʀᴇ ɪꜱ ᴛʜᴇ ʟɪɴᴋ ʙᴇʟᴏᴡ

我尝试了几种解决方案,但似乎没有一个能正确完成这项工作。其实我有这个代码:

<?php
$text = "ɪ ᴄᴀɴ'ᴛ ꜱᴛᴏᴘ ꜱʜᴀʀɪɴɢ ᴛʜᴇ ɢᴏᴏᴅ ɴᴇᴡꜱ ᴀʙᴏᴜᴛ ꜰᴏʀᴇx ᴍᴀʀᴋᴇᴛ ᴄᴏᴍᴘᴀɴʏ.
ᴡʜᴇɴ ɪ ꜰɪʀꜱᴛ ʜᴇᴀʀᴅ ɪᴛ, ɪ ᴡᴀꜱ ᴀꜰʀᴀɪᴅ ʙᴜᴛ ʟᴀᴛᴇʀ ꜱᴜᴍᴍᴏɴᴇᴅ ᴄᴏᴜʀᴀɢᴇ ᴀɴᴅ ᴍᴀᴅᴇ ᴀ ᴍᴏᴠᴇ ᴡɪᴛʜ $200
ɪ ꜱᴛɪʟʟ ᴄᴀɴ'ᴛ ʙᴇʟɪᴇᴠᴇ ᴛʜᴇ ᴘʟᴀᴛꜰᴏʀᴍ ɪꜱ ꜱo ʀᴇᴀʟ ᴜɴᴛɪʟ ɪ ʀᴇᴄᴇɪᴠᴇᴅ $3,100 IN 48HOURS of trade ᴀꜱ ᴍʏ ᴘʀᴏꜰɪᴛ
ᴛʜɪꜱ ɪꜱ ʏᴏᴜʀ ᴍᴏᴍᴇɴᴛ ᴏꜰ ʀᴇᴅᴇᴍᴘᴛɪᴏɴ ᴊᴜꜱᴛ ᴏɴᴇ ᴄʟɪᴄᴋ ᴀᴡᴀʏ ꜰʀᴏᴍ ɢʀᴇᴀᴛɴᴇꜱꜱ, ᴍᴀᴋᴇ ᴀ ᴍᴏᴠᴇ ɴᴏᴡ ʟᴇᴛ ʜɪꜱᴛᴏʀʏ ʙᴇ ᴍᴀᴅᴇ
ʜᴇʀᴇ ɪꜱ ᴛʜᴇ ʟɪɴᴋ ʙᴇʟᴏᴡ
";



$homoglyphes = array(
    " " => "\s",
    "A" => "AꭺᗅꓮᎪÅÁÀᴀÂÃАAÄΑ&quot;,
    "B" => "ᗷßꞴBΒвᛒꓐВᏼℬBβʙᏴ&quot;,
    "C" => "ⲤCℭꓚᏟℂCⅭСϹ&quot;,
    "D" => "ᗞĐᗪĎꓓDⅅⅮᴅDᎠꭰ",
    "E" => "ÈĚÉᴇЕĒℰ⋿ĔΕËꭼĖEEĘꓰÊᎬⴹ",
    "F" => "FꓝᖴꞘℱFϜ&quot;,
    "G" => "GԍɢᏀնꮐᏻꓖԌGᏳ&quot;,
    "H" => "ℍⲎꓧһнᎻℋꮋHᕼʜΗHНℌ&quot;,
    "I" => "ιⅠiᛁꭵاӏΙІlᎥ˛⍳IιіꙇⅰɪīiͺɩℹⅈıI&quot;,
    "J" => "ᎫᴊJͿյJꭻЈᒍꓙꞲ",
    "K" => "КᛕꓗKKⲔᏦΚK",
    "L" => "ιLⳐLlⳑʟⅬꓡᏞᒪℒꮮⅼ",
    "M" => "ᎷℳΜϺⅯᗰМMꓟᛖⲘM",
    "N" => "NℕⲚNɴꓠΝ&quot;,
    "O" => "οΟoՕО0OoOо",
    "P" => "ᏢꮲℙРᑭΡꓑᴩⲢᴘPP&quot;,
    "Q" => "QℚႳႭⵕQ&quot;,
    "R" => "ꭱRℝꮢᖇℛᚱℜƦRꓣᎡᏒʀ",
    "S" => "ᏕႽЅSSꓢssᏚՏѕ",
    "T" => "⟙ᎢΤтᴛⲦτꭲTT⊤Тꓔ",
    "U" => "ՍUUԱ⋃uμυሀ∪ꓴᑌ&quot;,
    "V" => "ꓦᏙѴⅤVꛟV۷٧ⴸᐯ",
    "W" => "ԜWwꓪWwᏔᎳ",
    "X" => "xꞳXꓫⅩΧ╳ᚷXⲬⵝχХ᙭&quot;,
    "Y" => "ᎩʏyҮϒγᎽꓬyуYYУⲨΥ&quot;,
    "Z" => "ℨℤᏃΖꓜZZ",
    "a" => "ã⍺αǎɑâаaáạäàăåȧaą",
    "b" => "ЬḇƅᏏᖯḅdḃlɓƄbbʙ",
    "c" => "ᴄⲥꮯᏟϲсⅭcⅽc",
    "d" => "ꓒԁᏧɗḏďddɖlᑯⅾḓժḑḋđcḍbⅆ&quot;,
    "e" => "ꬲ℮êėⅇȩҽēḛĕɇẹℯęéeëèеěce&quot;,
    "f" => "ꞙƒfẝfքꬵſϝḟꜰ&quot;,
    "g" => "ɡᶃɢǧgqģgնցġℊĝǥƍğǵ",
    "h" => "ħȟհᏂⱨẖһlḥḩℎɦhhĥḧḣḫ&quot;,
    "i" => "ιⅠiᛁɨꭵاӏ1lȋᎥ˛⍳ιіꙇⅰɪỉīĭiͺíɩℹịǐïⅈıIì&quot;,
    "j" => "jϳյɉʝјⅉj",
    "k" => "ḳḵkκⱪkķᴋ&quot;,
    "m" => "ᴍmmṁⅿḿṃɱrn",
    "n" => "nñrռmꞑṅńņǹɴnṇňṉո",
    "o" => "ᴏ&quot;,
    "p" => "ƥṗᏢṕpρ⍴ƿϱⲣPpр",
    "q" => "gգqʠqႭԛႳզ",
    "r" => "ṛrᴦꭈɼṙṟꭇȑԻгɾŕɍȓⲅŗrřʀɽꮁ&quot;,
    "s" => "ꜱႽЅṣƽŝṡSʂśSssᏚѕꮪșšՏ&quot;,
    "t" => "ṫᎢțƫτţtṭtŧ",
    "u" => "ůūǔùUꭎuՍUųűưꞟʉսûԱú⋃uũȗụüυμʋŭȕᴜꭒ&quot;,
    "v" => "⋁ѵѴvvⱱνטⱴᴠ∨ⅴṽꮩṿᶌ&quot;,
    "w" => "ẅẘɯWvwẇẁẉWwẃԝꮃաⱳᎳŵᴡѡ&quot;,
    "x" => "x⤬ᕽⅩᕁ᙮х×⤫ⅹχx⨯&quot;,
    "y" => "ʏɣyҮŷγƴỿℽɏꭚẏყỵүȳyýÿуYYᶌΥ&quot;,
    "z" => "ꮓźzᏃʐƶżⱬẕᴢẓz"
);

foreach ($homoglyphes as $letter=>$glyphes) {
    $tab = mb_str_split($glyphes);
    $text = str_replace($tab, $letter, $text);
}
echo $text;

?>

输出有问题:

I dAN'T sToP sHARING THE GooD NEws ABouT foREx nARkET donPANy.
wHEN I fIRsT HEARD IT, I wAs AfRAID BuT LATER sunnoNED douRAGE AND nADE A nowE wITH $2OO
I sTILL dAN'T BELIEwE THE PLATfoRn Is sO REAL uNTIL I REdEIwED $3,iOO IN 48HOuRs Of tnade As ny PRofIT
THIs Is youR nonENT of REDEnPTIoN JusT oNE dLIdk AwAy fRon GREATNEss, nAkE A nowE Now LET HIsToRy BE nADE
HERE Is THE LINk BELow

我不知道为什么。我可以获得正确结果的唯一方法是使用 TESSERACT-OCR(光学字符识别),但我需要创建一个带有文本的图像,这对于每秒处理数百条消息的机器人来说不是一个选项。

任何帮助,将不胜感激。谢谢你。

4

0 回答 0