4

我正在尝试使用 simpleXML 从http://rates.fxcm.com/RatesXML获取数据 使用simplexml_load_file()我有时会出错,因为该网站在 xml 文件之前和之后总是有奇怪的字符串/数字。例子:

2000<?xml version="1.0" encoding="UTF-8"?>
<Rates>
    <Rate Symbol="EURUSD">
    <Bid>1.27595</Bid>
    <Ask>1.2762</Ask>
    <High>1.27748</High>
    <Low>1.27385</Low>
    <Direction>-1</Direction>
    <Last>23:29:11</Last>
</Rate>
</Rates>
0

然后我决定使用 file_get_contents 并将其解析为带有 的字符串simplexml_load_string(),然后我用它substr()来删除前后的字符串。但是,有时随机字符串会出现在节点之间,如下所示:

<Rate Symbol="EURTRY">
    <Bid>2.29443</Bid>
    <Ask>2.29562</Ask>
    <High>2.29841</High>
    <Low>2.28999</Low>

137b

 <Direction>1</Direction>
    <Last>23:29:11</Last>
</Rate>

我的问题是,无论如何我都可以使用任何正则表达式函数处理所有这些随机字符串,而不管它们放在哪里?(认为​​这将是一个更好的主意,而不是联系网站让他们广播正确的 xml 文件)

4

1 回答 1

1

我相信用正则表达式预处理 XML 可能和解析它一样糟糕

但这是一个 preg 替换,它从字符串的开头、字符串的结尾以及关闭/自关闭标签之后删除所有非空白字符:

$string = preg_replace( '~
    (?|           # start of alternation where capturing group count starts from
                  # 1 for each alternative
      ^[^<]*      # match non-< characters at the beginning of the string
    |             # OR
      [^>]*$      # match non-> characters at the end of the string
    |             # OR
      (           # start of capturing group $1: closing tag
        </[^>]++> # match a closing tag; note the possessive quantifier (++); it
                  # suppresses backtracking, which is a convenient optimization,
                  # the following bit is mutually exclusive anyway (this will be
                  # used throughout the regex)
        \s++      # and the following whitespace
      )           # end of $1
      [^<\s]*+    # match non-<, non-whitespace characters (the "bad" ones)
      (?:         # start subgroup to repeat for more whitespace/non-whitespace
                  # sequences
        \s++      # match whitespace
        [^<\s]++  # match at least one "bad" character
      )*          # repeat
                  # note that this will kind of pattern keeps all whitespace
                  # before the first and the last "bad" character
    |             # OR
      (           # start of capturing group $1: self-closing tag
        <[^>/]+/> # match a self-closing tag
        \s++      # and the following whitespace
      )
      [^<]*+(?:\s++[^<\s]++)*
                  # same as before
    )             # end of alternation
    ~x',
    '$1',
    $input);

然后我们只需写回关闭或自关闭标签(如果有的话)。

这种方法不安全的原因之一是关闭或自关闭标签可能出现在注释或属性字符串中。但我很难建议您改用 XML 解析器,因为您的 XML 解析器也无法解析 XML。

于 2012-11-19T08:46:36.683 回答