22

是否有一个 PHP 函数可以将命名的 HTML 实体转换为它们各自的数字 HTML 实体?

例如:

$str = "Oggi è un bel giorno";
echo entities_to_unicode($str); // Oggi è un bel giorno

在此先感谢,祝您有美好的一天!

4

6 回答 6

28

您正在寻找一个从命名 HTML 实体到对应数字的简单转换函数。

这可以通过使用转换表(即数组)和字符串转换函数 ( strtr) 来完成:

$translated = strtr($string, $HTML401NamedToNumeric);

这适用于$stringUTF-8 编码或单字节字符集。

上面使用的 W3C 指定的 HTML 4.01 命名实体的示例数组如下。它包含 252 个实体。如果你想支持 XHTML,那么还有一个(我放在最后):

$HTML401NamedToNumeric = array(
    ' '     => ' ',  # no-break space = non-breaking space, U+00A0 ISOnum
    '¡'    => '¡',  # inverted exclamation mark, U+00A1 ISOnum
    '¢'     => '¢',  # cent sign, U+00A2 ISOnum
    '£'    => '£',  # pound sign, U+00A3 ISOnum
    '¤'   => '¤',  # currency sign, U+00A4 ISOnum
    '¥'      => '¥',  # yen sign = yuan sign, U+00A5 ISOnum
    '¦'   => '¦',  # broken bar = broken vertical bar, U+00A6 ISOnum
    '§'     => '§',  # section sign, U+00A7 ISOnum
    '¨'      => '¨',  # diaeresis = spacing diaeresis, U+00A8 ISOdia
    '©'     => '©',  # copyright sign, U+00A9 ISOnum
    'ª'     => 'ª',  # feminine ordinal indicator, U+00AA ISOnum
    '«'    => '«',  # left-pointing double angle quotation mark = left pointing guillemet, U+00AB ISOnum
    '¬'      => '¬',  # not sign, U+00AC ISOnum
    '­'      => '­',  # soft hyphen = discretionary hyphen, U+00AD ISOnum
    '®'      => '®',  # registered sign = registered trade mark sign, U+00AE ISOnum
    '¯'     => '¯',  # macron = spacing macron = overline = APL overbar, U+00AF ISOdia
    '°'      => '°',  # degree sign, U+00B0 ISOnum
    '±'   => '±',  # plus-minus sign = plus-or-minus sign, U+00B1 ISOnum
    '²'     => '²',  # superscript two = superscript digit two = squared, U+00B2 ISOnum
    '³'     => '³',  # superscript three = superscript digit three = cubed, U+00B3 ISOnum
    '´'    => '´',  # acute accent = spacing acute, U+00B4 ISOdia
    'µ'    => 'µ',  # micro sign, U+00B5 ISOnum
    '¶'     => '¶',  # pilcrow sign = paragraph sign, U+00B6 ISOnum
    '·'   => '·',  # middle dot = Georgian comma = Greek middle dot, U+00B7 ISOnum
    '¸'    => '¸',  # cedilla = spacing cedilla, U+00B8 ISOdia
    '¹'     => '¹',  # superscript one = superscript digit one, U+00B9 ISOnum
    'º'     => 'º',  # masculine ordinal indicator, U+00BA ISOnum
    '»'    => '»',  # right-pointing double angle quotation mark = right pointing guillemet, U+00BB ISOnum
    '¼'   => '¼',  # vulgar fraction one quarter = fraction one quarter, U+00BC ISOnum
    '½'   => '½',  # vulgar fraction one half = fraction one half, U+00BD ISOnum
    '¾'   => '¾',  # vulgar fraction three quarters = fraction three quarters, U+00BE ISOnum
    '¿'   => '¿',  # inverted question mark = turned question mark, U+00BF ISOnum
    'À'   => 'À',  # latin capital letter A with grave = latin capital letter A grave, U+00C0 ISOlat1
    'Á'   => 'Á',  # latin capital letter A with acute, U+00C1 ISOlat1
    'Â'    => 'Â',  # latin capital letter A with circumflex, U+00C2 ISOlat1
    'Ã'   => 'Ã',  # latin capital letter A with tilde, U+00C3 ISOlat1
    'Ä'     => 'Ä',  # latin capital letter A with diaeresis, U+00C4 ISOlat1
    'Å'    => 'Å',  # latin capital letter A with ring above = latin capital letter A ring, U+00C5 ISOlat1
    'Æ'    => 'Æ',  # latin capital letter AE = latin capital ligature AE, U+00C6 ISOlat1
    'Ç'   => 'Ç',  # latin capital letter C with cedilla, U+00C7 ISOlat1
    'È'   => 'È',  # latin capital letter E with grave, U+00C8 ISOlat1
    'É'   => 'É',  # latin capital letter E with acute, U+00C9 ISOlat1
    'Ê'    => 'Ê',  # latin capital letter E with circumflex, U+00CA ISOlat1
    'Ë'     => 'Ë',  # latin capital letter E with diaeresis, U+00CB ISOlat1
    'Ì'   => 'Ì',  # latin capital letter I with grave, U+00CC ISOlat1
    'Í'   => 'Í',  # latin capital letter I with acute, U+00CD ISOlat1
    'Î'    => 'Î',  # latin capital letter I with circumflex, U+00CE ISOlat1
    'Ï'     => 'Ï',  # latin capital letter I with diaeresis, U+00CF ISOlat1
    'Ð'      => 'Ð',  # latin capital letter ETH, U+00D0 ISOlat1
    'Ñ'   => 'Ñ',  # latin capital letter N with tilde, U+00D1 ISOlat1
    'Ò'   => 'Ò',  # latin capital letter O with grave, U+00D2 ISOlat1
    'Ó'   => 'Ó',  # latin capital letter O with acute, U+00D3 ISOlat1
    'Ô'    => 'Ô',  # latin capital letter O with circumflex, U+00D4 ISOlat1
    'Õ'   => 'Õ',  # latin capital letter O with tilde, U+00D5 ISOlat1
    'Ö'     => 'Ö',  # latin capital letter O with diaeresis, U+00D6 ISOlat1
    '×'    => '×',  # multiplication sign, U+00D7 ISOnum
    'Ø'   => 'Ø',  # latin capital letter O with stroke = latin capital letter O slash, U+00D8 ISOlat1
    'Ù'   => 'Ù',  # latin capital letter U with grave, U+00D9 ISOlat1
    'Ú'   => 'Ú',  # latin capital letter U with acute, U+00DA ISOlat1
    'Û'    => 'Û',  # latin capital letter U with circumflex, U+00DB ISOlat1
    'Ü'     => 'Ü',  # latin capital letter U with diaeresis, U+00DC ISOlat1
    'Ý'   => 'Ý',  # latin capital letter Y with acute, U+00DD ISOlat1
    'Þ'    => 'Þ',  # latin capital letter THORN, U+00DE ISOlat1
    'ß'    => 'ß',  # latin small letter sharp s = ess-zed, U+00DF ISOlat1
    'à'   => 'à',  # latin small letter a with grave = latin small letter a grave, U+00E0 ISOlat1
    'á'   => 'á',  # latin small letter a with acute, U+00E1 ISOlat1
    'â'    => 'â',  # latin small letter a with circumflex, U+00E2 ISOlat1
    'ã'   => 'ã',  # latin small letter a with tilde, U+00E3 ISOlat1
    'ä'     => 'ä',  # latin small letter a with diaeresis, U+00E4 ISOlat1
    'å'    => 'å',  # latin small letter a with ring above = latin small letter a ring, U+00E5 ISOlat1
    'æ'    => 'æ',  # latin small letter ae = latin small ligature ae, U+00E6 ISOlat1
    'ç'   => 'ç',  # latin small letter c with cedilla, U+00E7 ISOlat1
    'è'   => 'è',  # latin small letter e with grave, U+00E8 ISOlat1
    'é'   => 'é',  # latin small letter e with acute, U+00E9 ISOlat1
    'ê'    => 'ê',  # latin small letter e with circumflex, U+00EA ISOlat1
    'ë'     => 'ë',  # latin small letter e with diaeresis, U+00EB ISOlat1
    'ì'   => 'ì',  # latin small letter i with grave, U+00EC ISOlat1
    'í'   => 'í',  # latin small letter i with acute, U+00ED ISOlat1
    'î'    => 'î',  # latin small letter i with circumflex, U+00EE ISOlat1
    'ï'     => 'ï',  # latin small letter i with diaeresis, U+00EF ISOlat1
    'ð'      => 'ð',  # latin small letter eth, U+00F0 ISOlat1
    'ñ'   => 'ñ',  # latin small letter n with tilde, U+00F1 ISOlat1
    'ò'   => 'ò',  # latin small letter o with grave, U+00F2 ISOlat1
    'ó'   => 'ó',  # latin small letter o with acute, U+00F3 ISOlat1
    'ô'    => 'ô',  # latin small letter o with circumflex, U+00F4 ISOlat1
    'õ'   => 'õ',  # latin small letter o with tilde, U+00F5 ISOlat1
    'ö'     => 'ö',  # latin small letter o with diaeresis, U+00F6 ISOlat1
    '÷'   => '÷',  # division sign, U+00F7 ISOnum
    'ø'   => 'ø',  # latin small letter o with stroke, = latin small letter o slash, U+00F8 ISOlat1
    'ù'   => 'ù',  # latin small letter u with grave, U+00F9 ISOlat1
    'ú'   => 'ú',  # latin small letter u with acute, U+00FA ISOlat1
    'û'    => 'û',  # latin small letter u with circumflex, U+00FB ISOlat1
    'ü'     => 'ü',  # latin small letter u with diaeresis, U+00FC ISOlat1
    'ý'   => 'ý',  # latin small letter y with acute, U+00FD ISOlat1
    'þ'    => 'þ',  # latin small letter thorn, U+00FE ISOlat1
    'ÿ'     => 'ÿ',  # latin small letter y with diaeresis, U+00FF ISOlat1
    'ƒ'     => 'ƒ',  # latin small f with hook = function = florin, U+0192 ISOtech
    'Α'    => 'Α',  # greek capital letter alpha, U+0391
    'Β'     => 'Β',  # greek capital letter beta, U+0392
    'Γ'    => 'Γ',  # greek capital letter gamma, U+0393 ISOgrk3
    'Δ'    => 'Δ',  # greek capital letter delta, U+0394 ISOgrk3
    'Ε'  => 'Ε',  # greek capital letter epsilon, U+0395
    'Ζ'     => 'Ζ',  # greek capital letter zeta, U+0396
    'Η'      => 'Η',  # greek capital letter eta, U+0397
    'Θ'    => 'Θ',  # greek capital letter theta, U+0398 ISOgrk3
    'Ι'     => 'Ι',  # greek capital letter iota, U+0399
    'Κ'    => 'Κ',  # greek capital letter kappa, U+039A
    'Λ'   => 'Λ',  # greek capital letter lambda, U+039B ISOgrk3
    'Μ'       => 'Μ',  # greek capital letter mu, U+039C
    'Ν'       => 'Ν',  # greek capital letter nu, U+039D
    'Ξ'       => 'Ξ',  # greek capital letter xi, U+039E ISOgrk3
    'Ο'  => 'Ο',  # greek capital letter omicron, U+039F
    'Π'       => 'Π',  # greek capital letter pi, U+03A0 ISOgrk3
    'Ρ'      => 'Ρ',  # greek capital letter rho, U+03A1
    'Σ'    => 'Σ',  # greek capital letter sigma, U+03A3 ISOgrk3
    'Τ'      => 'Τ',  # greek capital letter tau, U+03A4
    'Υ'  => 'Υ',  # greek capital letter upsilon, U+03A5 ISOgrk3
    'Φ'      => 'Φ',  # greek capital letter phi, U+03A6 ISOgrk3
    'Χ'      => 'Χ',  # greek capital letter chi, U+03A7
    'Ψ'      => 'Ψ',  # greek capital letter psi, U+03A8 ISOgrk3
    'Ω'    => 'Ω',  # greek capital letter omega, U+03A9 ISOgrk3
    'α'    => 'α',  # greek small letter alpha, U+03B1 ISOgrk3
    'β'     => 'β',  # greek small letter beta, U+03B2 ISOgrk3
    'γ'    => 'γ',  # greek small letter gamma, U+03B3 ISOgrk3
    'δ'    => 'δ',  # greek small letter delta, U+03B4 ISOgrk3
    'ε'  => 'ε',  # greek small letter epsilon, U+03B5 ISOgrk3
    'ζ'     => 'ζ',  # greek small letter zeta, U+03B6 ISOgrk3
    'η'      => 'η',  # greek small letter eta, U+03B7 ISOgrk3
    'θ'    => 'θ',  # greek small letter theta, U+03B8 ISOgrk3
    'ι'     => 'ι',  # greek small letter iota, U+03B9 ISOgrk3
    'κ'    => 'κ',  # greek small letter kappa, U+03BA ISOgrk3
    'λ'   => 'λ',  # greek small letter lambda, U+03BB ISOgrk3
    'μ'       => 'μ',  # greek small letter mu, U+03BC ISOgrk3
    'ν'       => 'ν',  # greek small letter nu, U+03BD ISOgrk3
    'ξ'       => 'ξ',  # greek small letter xi, U+03BE ISOgrk3
    'ο'  => 'ο',  # greek small letter omicron, U+03BF NEW
    'π'       => 'π',  # greek small letter pi, U+03C0 ISOgrk3
    'ρ'      => 'ρ',  # greek small letter rho, U+03C1 ISOgrk3
    'ς'   => 'ς',  # greek small letter final sigma, U+03C2 ISOgrk3
    'σ'    => 'σ',  # greek small letter sigma, U+03C3 ISOgrk3
    'τ'      => 'τ',  # greek small letter tau, U+03C4 ISOgrk3
    'υ'  => 'υ',  # greek small letter upsilon, U+03C5 ISOgrk3
    'φ'      => 'φ',  # greek small letter phi, U+03C6 ISOgrk3
    'χ'      => 'χ',  # greek small letter chi, U+03C7 ISOgrk3
    'ψ'      => 'ψ',  # greek small letter psi, U+03C8 ISOgrk3
    'ω'    => 'ω',  # greek small letter omega, U+03C9 ISOgrk3
    'ϑ' => 'ϑ',  # greek small letter theta symbol, U+03D1 NEW
    'ϒ'    => 'ϒ',  # greek upsilon with hook symbol, U+03D2 NEW
    'ϖ'      => 'ϖ',  # greek pi symbol, U+03D6 ISOgrk3
    '•'     => '•', # bullet = black small circle, U+2022 ISOpub
    '…'   => '…', # horizontal ellipsis = three dot leader, U+2026 ISOpub
    '′'    => '′', # prime = minutes = feet, U+2032 ISOtech
    '″'    => '″', # double prime = seconds = inches, U+2033 ISOtech
    '‾'    => '‾', # overline = spacing overscore, U+203E NEW
    '⁄'    => '⁄', # fraction slash, U+2044 NEW
    '℘'   => '℘', # script capital P = power set = Weierstrass p, U+2118 ISOamso
    'ℑ'    => 'ℑ', # blackletter capital I = imaginary part, U+2111 ISOamso
    'ℜ'     => 'ℜ', # blackletter capital R = real part symbol, U+211C ISOamso
    '™'    => '™', # trade mark sign, U+2122 ISOnum
    'ℵ'  => 'ℵ', # alef symbol = first transfinite cardinal, U+2135 NEW
    '←'     => '←', # leftwards arrow, U+2190 ISOnum
    '↑'     => '↑', # upwards arrow, U+2191 ISOnum
    '→'     => '→', # rightwards arrow, U+2192 ISOnum
    '↓'     => '↓', # downwards arrow, U+2193 ISOnum
    '↔'     => '↔', # left right arrow, U+2194 ISOamsa
    '↵'    => '↵', # downwards arrow with corner leftwards = carriage return, U+21B5 NEW
    '⇐'     => '⇐', # leftwards double arrow, U+21D0 ISOtech
    '⇑'     => '⇑', # upwards double arrow, U+21D1 ISOamsa
    '⇒'     => '⇒', # rightwards double arrow, U+21D2 ISOtech
    '⇓'     => '⇓', # downwards double arrow, U+21D3 ISOamsa
    '⇔'     => '⇔', # left right double arrow, U+21D4 ISOamsa
    '∀'   => '∀', # for all, U+2200 ISOtech
    '∂'     => '∂', # partial differential, U+2202 ISOtech
    '∃'    => '∃', # there exists, U+2203 ISOtech
    '∅'    => '∅', # empty set = null set = diameter, U+2205 ISOamso
    '∇'    => '∇', # nabla = backward difference, U+2207 ISOtech
    '∈'     => '∈', # element of, U+2208 ISOtech
    '∉'    => '∉', # not an element of, U+2209 ISOtech
    '∋'       => '∋', # contains as member, U+220B ISOtech
    '∏'     => '∏', # n-ary product = product sign, U+220F ISOamsb
    '∑'      => '∑', # n-ary sumation, U+2211 ISOamsb
    '−'    => '−', # minus sign, U+2212 ISOtech
    '∗'   => '∗', # asterisk operator, U+2217 ISOtech
    '√'    => '√', # square root = radical sign, U+221A ISOtech
    '∝'     => '∝', # proportional to, U+221D ISOtech
    '∞'    => '∞', # infinity, U+221E ISOtech
    '∠'      => '∠', # angle, U+2220 ISOamso
    '∧'      => '∧', # logical and = wedge, U+2227 ISOtech
    '∨'       => '∨', # logical or = vee, U+2228 ISOtech
    '∩'      => '∩', # intersection = cap, U+2229 ISOtech
    '∪'      => '∪', # union = cup, U+222A ISOtech
    '∫'      => '∫', # integral, U+222B ISOtech
    '∴'   => '∴', # therefore, U+2234 ISOtech
    '∼'      => '∼', # tilde operator = varies with = similar to, U+223C ISOtech
    '≅'     => '≅', # approximately equal to, U+2245 ISOtech
    '≈'    => '≈', # almost equal to = asymptotic to, U+2248 ISOamsr
    '≠'       => '≠', # not equal to, U+2260 ISOtech
    '≡'    => '≡', # identical to, U+2261 ISOtech
    '≤'       => '≤', # less-than or equal to, U+2264 ISOtech
    '≥'       => '≥', # greater-than or equal to, U+2265 ISOtech
    '⊂'      => '⊂', # subset of, U+2282 ISOtech
    '⊃'      => '⊃', # superset of, U+2283 ISOtech
    '⊄'     => '⊄', # not a subset of, U+2284 ISOamsn
    '⊆'     => '⊆', # subset of or equal to, U+2286 ISOtech
    '⊇'     => '⊇', # superset of or equal to, U+2287 ISOtech
    '⊕'    => '⊕', # circled plus = direct sum, U+2295 ISOamsb
    '⊗'   => '⊗', # circled times = vector product, U+2297 ISOamsb
    '⊥'     => '⊥', # up tack = orthogonal to = perpendicular, U+22A5 ISOtech
    '⋅'     => '⋅', # dot operator, U+22C5 ISOamsb
    '⌈'    => '⌈', # left ceiling = apl upstile, U+2308 ISOamsc
    '⌉'    => '⌉', # right ceiling, U+2309 ISOamsc
    '⌊'   => '⌊', # left floor = apl downstile, U+230A ISOamsc
    '⌋'   => '⌋', # right floor, U+230B ISOamsc
    '⟨'     => '〈', # left-pointing angle bracket = bra, U+2329 ISOtech
    '⟩'     => '〉', # right-pointing angle bracket = ket, U+232A ISOtech
    '◊'      => '◊', # lozenge, U+25CA ISOpub
    '♠'   => '♠', # black spade suit, U+2660 ISOpub
    '♣'    => '♣', # black club suit = shamrock, U+2663 ISOpub
    '♥'   => '♥', # black heart suit = valentine, U+2665 ISOpub
    '♦'    => '♦', # black diamond suit, U+2666 ISOpub
    '"'     => '"',   # quotation mark = APL quote, U+0022 ISOnum
    '&'      => '&',   # ampersand, U+0026 ISOnum
    '<'       => '<',   # less-than sign, U+003C ISOnum
    '>'       => '>',   # greater-than sign, U+003E ISOnum
    'Œ'    => 'Œ',  # latin capital ligature OE, U+0152 ISOlat2
    'œ'    => 'œ',  # latin small ligature oe, U+0153 ISOlat2
    'Š'   => 'Š',  # latin capital letter S with caron, U+0160 ISOlat2
    'š'   => 'š',  # latin small letter s with caron, U+0161 ISOlat2
    'Ÿ'     => 'Ÿ',  # latin capital letter Y with diaeresis, U+0178 ISOlat2
    'ˆ'     => 'ˆ',  # modifier letter circumflex accent, U+02C6 ISOpub
    '˜'    => '˜',  # small tilde, U+02DC ISOdia
    ' '     => ' ', # en space, U+2002 ISOpub
    ' '     => ' ', # em space, U+2003 ISOpub
    ' '   => ' ', # thin space, U+2009 ISOpub
    '‌'     => '‌', # zero width non-joiner, U+200C NEW RFC 2070
    '‍'      => '‍', # zero width joiner, U+200D NEW RFC 2070
    '‎'      => '‎', # left-to-right mark, U+200E NEW RFC 2070
    '‏'      => '‏', # right-to-left mark, U+200F NEW RFC 2070
    '–'    => '–', # en dash, U+2013 ISOpub
    '—'    => '—', # em dash, U+2014 ISOpub
    '‘'    => '‘', # left single quotation mark, U+2018 ISOnum
    '’'    => '’', # right single quotation mark, U+2019 ISOnum
    '‚'    => '‚', # single low-9 quotation mark, U+201A NEW
    '“'    => '“', # left double quotation mark, U+201C ISOnum
    '”'    => '”', # right double quotation mark, U+201D ISOnum
    '„'    => '„', # double low-9 quotation mark, U+201E NEW
    '†'   => '†', # dagger, U+2020 ISOpub
    '‡'   => '‡', # double dagger, U+2021 ISOpub
    '‰'   => '‰', # per mille sign, U+2030 ISOtech
    '‹'   => '‹', # single left-pointing angle quotation mark, U+2039 ISO proposed
    '›'   => '›', # single right-pointing angle quotation mark, U+203A ISO proposed
    '€'     => '€', # euro sign, U+20AC NEW
);

还有一个用于 XHTML:

    '''     => ''',   # apostrophe = APL quote, U+0027 ISOnum
于 2012-06-24T18:00:08.820 回答
6

此解决方案基于php.net中的代码:

function entities_to_unicode($str) {
    $str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
    $str = preg_replace_callback("/(&#[0-9]+;)/", function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); }, $str);
    return $str;
}

$str = 'Oggi è un bel giorno';
echo entities_to_unicode($str);
于 2012-06-24T14:10:57.300 回答
5

这个

  1. 不需要枚举用户代码中的实体,
  2. 适用于包含命名实体的 HTML 代码(粗鲁地将 html_entity_decode 应用于整个字符串会弄乱 < 和 >(转换的 < 和 >)和 HTML 标记开始/结束):

这里是

function htmlent2xml($s) {
    return preg_replace_callback("/(&[a-zA-Z][a-zA-Z0-9]*;)/",function($m){
       $c = html_entity_decode($m[0],ENT_HTML5,"UTF-8");
       # return htmlentities($c,ENT_XML1,"UTF-8"); -- see update below

       $convmap = array(0x80, 0xffff, 0, 0xffff);
       return mb_encode_numericentity($c, $convmap, 'UTF-8');
    },$s);
}

upd:根据 Carlos 的评论,建议的解决方案是使用答案修复以获得所需的数字实体。

于 2016-10-06T14:36:19.510 回答
3
echo preg_replace('/[^!-%\x27-;=?-~ ]/e', '"&#".ord("$0").";"', html_entity_decode($str))
于 2012-06-24T10:41:55.150 回答
0

首先使用html_entity_decode获取源代码的未编码版本。如有必要,将第三个参数(编码)设置为适当的值。

然后在该源代码上使用utf8_encode 。

$source_code_without_entities = html_entity_decode($source_code_with_entities);
$utf8_source_code = utf8_encode($source_code_without_entities);
于 2012-06-24T13:01:47.063 回答
-1

@hakre 提供的答案是唯一真正能解决所提出问题的答案。有趣的是,所有其他答案,包括被接受的答案,都不起作用。顺便说一句,接受的答案并没有真正做任何事情!其他人至少做了一些事情,但这是错误的。试图回答的人,似乎不明白作者想要的是将命名实体转换为他们的数字对应物。

以下是我的贡献,基于 PHP 文档 ( https://www.php.net/manual/pt_BR/function.htmlentities.php#106535 )的评论

function xmlentities($aString) {
    $validChars = "A-Z0-9a-z\s_-";
    $twoChars = null;
    return preg_replace_callback("/[^$validChars]/"
    // Utilizar use(&$twoChars) faz com que $twoChars seja visível dentro da 
    // função anônima. É necessário usar o "&" se se pretende alterar o 
    // valor desta variável 
                                ,function ($aMatches) use(&$twoChars) { 
                                    $oneChar = $aMatches[0];
                                    switch($oneChar) {
    // Realiza substituições diretas. No caso, substitui as entidades que o 
    // XML reconhece. Eu poderia ter usado uma função do próprio PHP para 
    // isso, mas resolvi não usar porque são só 5 caracteres a substituir
                                        case "'": return "&apos;";
                                        case '"': return "&quot;";
                                        case '&': return "&amp;";
                                        case '<': return "&lt;";
                                        case '>': return "&gt;";
    // Caso não seja uma entidade reconhecida pelo xml, tratamentos 
    // especiais são necessários para identificar estamos lidadando com 
    // caracteres ISO-8859-1 ou UTF-8
                                        default: 
    // A tabela UTF-8 estende de forma compatível a tabela ASCII. Os 
    // primeiros 127 caracteres tem 1 byte e todos os demais tem dois bytes.
    // Os caracteres UTF-8 com 2 bytes começam com C2 (194) e seguem a 
    // sequência até chegar em CF (207). A condição abaixo detecta a 
    // existência de um destes bytes, que identificam um caractere UTF-8. 
    // Neste caso, se deve acumular ele numa variável com o intuito de,
    // posteriormente realizar a conversão de dois bytes e obter um único 
    // byte ISO-8859-1. Nesta primeira condição, há apenas o acúmulo na 
    // variável. Nada é retornado
                                            if (194 <= ord($oneChar) && ord($oneChar) <= 207) { 
                                                $twoChars = $oneChar;
                                                return;
    // Caso $twoChars contenha um valor, é porque em um passo anterior ele 
    // foi preenchido com o primeiro caractere de um par UTF-8. Neste caso 
    // devemos concatenar o segundo para, convertê-los para ISO-8859-1 e 
    // atribuir null à variável de controle ($twoChars). Em seguida, 
    // retornamos a saída formatada com o ordinal do caractere na tabela 
    // ISO-8859-1
                                            } else if ($twoChars) { 
                                                $twoChars .= $oneChar;
                                                $ansiChar = utf8_decode($twoChars);
                                                $twoChars = null;
                                                return "&#" . str_pad(ord($ansiChar), 3, "0", STR_PAD_LEFT) . ";";
    // Caso a string informada no argumento $aString da função já esteja 
    // codificada em ISO-88959-1, todos os seus caracteres terão 1 byte e 
    // neste caso, basta formatar diretamente este byte
                                            } else {
                                                return "&#" . str_pad(ord($oneChar), 3, "0", STR_PAD_LEFT) . ";";       
                                            }
                                    }
                                }
                                ,$aString);
}

我的版本带有注释(使用谷歌翻译),并且只能处理“原始”字符串,没有实体(& xxx;),所以要使用它,如果你的字符串有命名实体,首先将其转换为原始形式:

$text = "Oggi &egrave; un bel&nbsp;giorno";

$text = html_entity_decode($text,ENT_QUOTES || ENT_HTML5,"UTF-8");

$text = xmlentities($text);

echo($text); // Output = Oggi &#232; un bel&#160;giorno
于 2020-06-27T20:08:09.943 回答