2

I am trying to replace special characters with HTML entities, But the results are random with the same input and I don't understand why.

Here is the code :

def secure(text):
    hsconvert = {"\'": "\\'", "\"": "\\\"", "¢": "&cent;", "©": "&copy;", "÷": "&divide;", ">": "&gt;", "<": "&lt;", "µ": "&micro;", "·": "&middot;", "¶": "&para;", "±": "&plusmn;", "€": "&euro;", "£": "&pound;", "®": "&reg;", "§": "&sect;", "™": "&trade;", "¥": "&yen;", "á": "&aacute;", "Á": "&Aacute;", "à": "&agrave;", "À": "&Agrave;", "â": "&acirc;", "Â": "&Acirc;", "å": "&aring;", "Å": "&Aring;", "ã": "&atilde;", "Ã": "&Atilde;", "ä": "&auml;", "Ä": "&Auml;", "æ": "&aelig;", "Æ": "&AElig;", "ç": "&ccedil;", "Ç": "&Ccedil;", "é": "&eacute;", "É": "&Eacute;", "è": "&egrave;", "È": "&Egrave;", "ê": "&ecirc;", "Ê": "&Ecirc;", "ë": "&euml;", "Ë": "&Euml;", "í": "&iacute;", "Í": "&Iacute;", "ì": "&igrave;", "Ì": "&Igrave;", "î": "&icirc;", "Î": "&Icirc;", "ï": "&iuml;", "Ï": "&Iuml;", "ñ": "&ntilde;", "Ñ": "&Ntilde;", "ó": "&oacute;", "Ó": "&Oacute;", "ò": "&ograve;", "Ò": "&Ograve;", "ô": "&ocirc;", "Ô": "&Ocirc;", "ø": "&oslash;", "Ø": "&Oslash;", "õ": "&otilde;", "Õ": "&Otilde;", "ö": "&ouml;", "Ö": "&Ouml;", "ß": "&szlig;", "ú": "&uacute;", "Ú": "&Uacute;", "ù": "&ugrave;", "Ù": "&Ugrave;", "û": "&ucirc;", "Û": "&Ucirc;", "ü": "&uuml;", "Ü": "&Uuml;", "ÿ": "&yuml;", "\\":"\\\\"};
    for i, j in hsconvert.items():
        text = text.replace(i, j)
        return text

print(secure("La Vie d'Adèle, chapitres 1 & 2"))

Here are the console outputs:

>>> ================================ RESTART ================================
>>> 
La Vie d\'Ad&egrave;le, chapitres 1 & 2
['TV Movie', 'Video Game', 'TV Episode', 'TV Series', 'TV Series ', 'Short', 'TV Mini-Series']
>>> ================================ RESTART ================================
>>> 
La Vie d\\'Ad&egrave;le, chapitres 1 & 2
['TV Movie', 'Video Game', 'TV Episode', 'TV Series', 'TV Series ', 'Short', 'TV Mini-Series']

The problem is with the ' character which is sometimes returned as \' and sometimes as \\'.

I think it is coming from the last item in the dictionary, "\\":"\\\\" but I don't understand why it is not interpreted the same on each run.

4

3 回答 3

3

正如您在回答中推测的那样,问题在于字典上的迭代没有定义的顺序。

Python 3 文档

对字典执行 list(d.keys()) 以任意顺序返回字典中使用的所有键的列表(如果要对其进行排序,只需使用 sorted(d.keys()) 代替)。

它没有明确说明,但同样适用于 items()。

在这种情况下,看到迭代之间的顺序发生变化,我有点惊讶,但在这种情况下,任意意味着未定义——任何顺序在技术上都是有效的。如果您想要一致的结果,我建议您重新设计您的算法,使其对项目的顺序完全不敏感;如果做不到这一点,首先对输出进行排序或使用 OrderedDict 至少可以解决一致性问题。

于 2013-11-03T19:53:40.243 回答
0

有时,您的代码首先替换\\\\\\,然后\'替换为\\'. 有时它会反其道而行之。

示例(使用“\'”作为输入):

如果我们先\\-> \\\\,然后\'->\\'我们\'在第一次尝试替换之后得到(因为没有 a ,所以什么都没有发生\\),然后\\'在第二次之后。

但是如果我们反过来做,我们会得到\\'第一个,然后它用第二个替换\\\\\\所以我们最终得到\\\\'!

发生这种情况是因为hsconvert它是一个字典,所以它没有排序,并且遍历它(for循环)不一定每次都以相同的方式发生。

你解决它的方法很好,但为了将来参考,模块中有OrderedDict一个collections

于 2013-11-03T19:51:37.510 回答
0

我已按如下方式修改了函数并且它正在工作:

def secure(text):
    text.replace("\\", "\\\\")
    hsconvert = {"\'": "\\'", "\"": "\\\"", "¢": "&cent;", "©": "&copy;", "÷": "&divide;", ">": "&gt;", "<": "&lt;", "µ": "&micro;", "·": "&middot;", "¶": "&para;", "±": "&plusmn;", "€": "&euro;", "£": "&pound;", "®": "&reg;", "§": "&sect;", "™": "&trade;", "¥": "&yen;", "á": "&aacute;", "Á": "&Aacute;", "à": "&agrave;", "À": "&Agrave;", "â": "&acirc;", "Â": "&Acirc;", "å": "&aring;", "Å": "&Aring;", "ã": "&atilde;", "Ã": "&Atilde;", "ä": "&auml;", "Ä": "&Auml;", "æ": "&aelig;", "Æ": "&AElig;", "ç": "&ccedil;", "Ç": "&Ccedil;", "é": "&eacute;", "É": "&Eacute;", "è": "&egrave;", "È": "&Egrave;", "ê": "&ecirc;", "Ê": "&Ecirc;", "ë": "&euml;", "Ë": "&Euml;", "í": "&iacute;", "Í": "&Iacute;", "ì": "&igrave;", "Ì": "&Igrave;", "î": "&icirc;", "Î": "&Icirc;", "ï": "&iuml;", "Ï": "&Iuml;", "ñ": "&ntilde;", "Ñ": "&Ntilde;", "ó": "&oacute;", "Ó": "&Oacute;", "ò": "&ograve;", "Ò": "&Ograve;", "ô": "&ocirc;", "Ô": "&Ocirc;", "ø": "&oslash;", "Ø": "&Oslash;", "õ": "&otilde;", "Õ": "&Otilde;", "ö": "&ouml;", "Ö": "&Ouml;", "ß": "&szlig;", "ú": "&uacute;", "Ú": "&Uacute;", "ù": "&ugrave;", "Ù": "&Ugrave;", "û": "&ucirc;", "Û": "&Ucirc;", "ü": "&uuml;", "Ü": "&Uuml;", "ÿ": "&yuml;"};
    for i, j in hsconvert.items():
        text = text.replace(i, j)
    return text

但我不明白为什么旧功能不起作用... A for x in ... 并不总是相同的顺序?

于 2013-11-03T19:45:43.177 回答