9

我需要帮助编写一个将 HTML 字符串转换为有效 XML 标记名称的正则表达式函数。例如:它需要一个字符串并执行以下操作:

  • 如果字符串中出现字母或下划线,则保留它
  • 如果出现任何其他字符,则会将其从输出字符串中删除。
  • 如果单词或字母之间出现任何其他字符,则将其替换为下划线。
Ex:
Input: Date Created
Ouput: Date_Created

Input: Date<br/>Created
Output: Date_Created

Input: Date\nCreated
Output: Date_Created

Input: Date    1 2 3 Created
Output: Date_Created

基本上,正则表达式函数应该将 HTML 字符串转换为有效的 XML 标记。

4

4 回答 4

5

一点正则表达式和一点标准函数:

function mystrip($s)
{
        // add spaces around angle brackets to separate tag-like parts
        // e.g. "<br />" becomes " <br /> "
        // then let strip_tags take care of removing html tags
        $s = strip_tags(str_replace(array('<', '>'), array(' <', '> '), $s));

        // any sequence of characters that are not alphabet or underscore
        // gets replaced by a single underscore
        return preg_replace('/[^a-z_]+/i', '_', $s);
}
于 2012-06-03T04:39:18.543 回答
2

试试这个

$result = preg_replace('/([\d\s]|<[^<>]+>)/', '_', $subject);

解释

"
(               # Match the regular expression below and capture its match into backreference number 1
                   # Match either the regular expression below (attempting the next alternative only if this one fails)
      [\d\s]          # Match a single character present in the list below
                         # A single digit 0..9
                         # A whitespace character (spaces, tabs, and line breaks)
   |               # Or match regular expression number 2 below (the entire group fails if this one fails to match)
      <               # Match the character “&lt;” literally
      [^<>]           # Match a single character NOT present in the list “&lt;>”
         +               # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      >               # Match the character “&gt;” literally
)
"
于 2012-06-03T04:20:34.823 回答
2

应该可以使用:

$text = preg_replace( '/(?<=[a-zA-Z])[^a-zA-Z_]+(?=[a-zA-Z])/', '_', $text );

因此,有环顾四周,看看前后是否有一个字母字符,并替换它之间的任何非字母/非下划线。

于 2012-06-03T04:20:57.443 回答
1

我相信以下应该有效。

preg_replace('/[^A-Za-z_]+(.*)?([^A-Za-z_]+)?/', '_', $string);

正则表达式的第一部分[^A-Za-z_]+匹配一个或多个非字母或下划线的字符。正则表达式的结尾部分是相同的,只是它是可选的。这是为了允许中间部分(.*)?(也是可选的)捕获两个列入黑名单的字符之间的任何字符(甚至是字母和下划线)。

于 2012-06-03T04:22:20.970 回答