我有一个所见即所得的网站。问题是用户复制粘贴了大量数据,留下了许多未关闭且格式不正确的 div 标签,这些标签破坏了网站布局。
and </div>
str_replace 不起作用,因为某些 div 中包含样式和其他内容,因此需要考虑<div style="some styling"> <div align="center">
我有一个所见即所得的网站。问题是用户复制粘贴了大量数据,留下了许多未关闭且格式不正确的 div 标签,这些标签破坏了网站布局。
and </div>
str_replace 不起作用,因为某些 div 中包含样式和其他内容,因此需要考虑<div style="some styling"> <div align="center">
最好将 DOM 用于 HTML 解析器,但如果您别无选择,只能使用 RegEx,那么您可以像这样使用它:
$patterns = array();
$patterns[0] = '/<div[^>]*>/';
$patterns[1] = '/<\/div>/';
$replacements = array();
$replacements[2] = '';
$replacements[1] = '';
echo preg_replace($patterns, $replacements, $html);
不,您永远不会使用正则表达式解析/操作 HTML。
正则表达式不能讨价还价。他们无法解释。他们不懂 html,也不懂 xml。他们绝对不会停止,直到你的 DOM 树死了。
这是一个简化的示例,说明如何使用 PHP 进行操作
* Removes the divs because why not
function strip_divs(&$text, $id = 'html') {
$replacements = array();
worker($text, $replacements, $id);
foreach ($replacements as $key => $val) {
$text = mb_str_replace($key, $val, $text);
return $text;
function worker(&$body, &$replacements, $id) {
static $call_count;
if (empty($call_count)) {
$call_count = array();
if (empty($call_count[$id])) {
$call_count[$id] = 0;
if (mb_strpos($body, '</div>')) {
$body = mb_str_replace('</div>', '', $body);
if (mb_strpos($body, '<di') !== FALSE) {
$call_count[$id] ++;
// Gets the important junk
$rm = '<di' . xml_get($body, '<di', '>') . '>';
// Builds the replacements HTML
$replacement_html = '';
$next_id = count($replacements);
$replacement_id = "[[div-$next_id]]";
$replacements[$replacement_id] = $replacement_html;
$body = mb_str_replace($rm, $replacement_id, $body);
if (mb_strpos($body, '<di') !== FALSE && $call_count[$id] < 200) {
worker($body, $replacements, $id);
* Returns text by specifying a start and end point
* @param str $str
* The text to search
* @param str $start
* The beginning identifier
* @param str $end
* The ending identifier
function xml_get($str, $start, $end) {
$str = "|" . $str . "|";
$len = mb_strlen($start);
if (mb_strpos($str, $start) > 0) {
$int_start = mb_strpos($str, $start) + $len;
$temp = right($str, (mb_strlen($str) - $int_start));
$int_end = mb_strpos($temp, $end);
$return = trim(left($temp, $int_end));
return $return;
else {
return FALSE;
function right($str, $count) {
return mb_substr($str, ($count * -1));
function left($str, $count) {
return mb_substr($str, 0, $count);
* Multibyte str replace
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject, &$count = 0) {
if (!is_array($subject)) {
$searches = is_array($search) ? array_values($search) : array($search);
$replacements = is_array($replace) ? array_values($replace) : array($replace);
$replacements = array_pad($replacements, count($searches), '');
foreach ($searches as $key => $search) {
$parts = mb_split(preg_quote($search), $subject);
$count += count($parts) - 1;
$subject = implode($replacements[$key], $parts);
else {
foreach ($subject as $key => $value) {
$subject[$key] = mb_str_replace($search, $replace, $value, $count);
return $subject;
$html = <<<HTML
<td class="votecell">
<div class="vote">
<input type="hidden" name="_id_" value="9607101">
<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>
<span itemprop="upvoteCount" class="vote-count-post ">0</span>
<a class="vote-down-off" title="This question does not show any research effort; it is unclear or not useful">down vote</a>
<a class="star-off" href="#">favorite</a>
<div class="favoritecount"><b></b></div>
<td class="postcell">
<div class="post-text" itemprop="text">
<p>I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout. </p>
<p>Is there an easy an easy way to strip all occurrences of <code><div></code> and <code></div></code>?</p>
<p>str_replace won't work because some of the divs have styling and other things in them so it would need to account for <code><div style="some styling"> <div align="center"></code> etc</p>
<p>I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those. </p>
<p>Thanks a lot,
<div class="post-taglist">
<a href="/questions/tagged/php" class="post-tag js-gps-track" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/regex" class="post-tag js-gps-track" title="show questions tagged 'regex'" rel="tag">regex</a> <a href="/questions/tagged/replace" class="post-tag js-gps-track" title="show questions tagged 'replace'" rel="tag">replace</a> <a href="/questions/tagged/str-replace" class="post-tag js-gps-track" title="" rel="tag">str-replace</a> <a href="/questions/tagged/strip-tags" class="post-tag js-gps-track" title="show questions tagged 'strip-tags'" rel="tag">strip-tags</a>
<table class="fw">
<td class="vt">
<div class="post-menu"><a href="/q/9607101" title="short permalink to this question" class="short-link" id="link-post-9607101">share</a><span class="lsep">|</span><a href="/posts/9607101/edit" class="suggest-edit-post" title="">improve this question</a></div>
<td align="right" class="post-signature">
<div class="user-info ">
<div class="user-action-time">
<a href="/posts/9607101/revisions" title="show all edits to this post">edited <span title="2012-03-07 18:32:29Z" class="relativetime">Mar 7 '12 at 18:32</span></a>
<div class="user-gravatar32">
<div class="user-details">
<div class="-flair">
<td class="post-signature owner">
<div class="user-info ">
<div class="user-action-time">
asked <span title="2012-03-07 18:31:11Z" class="relativetime">Mar 7 '12 at 18:31</span>
<div class="user-gravatar32">
<a href="/users/702826/martin-hunt">
<div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a578c3eae229c86dbe46d4b1603e071b?s=32&d=identicon&r=PG" alt="" width="32" height="32"></div>
<div class="user-details">
<a href="/users/702826/martin-hunt">Martin Hunt</a>
<div class="-flair">
<span class="reputation-score" title="reputation score " dir="ltr">313</span><span title="7 silver badges"><span class="badge2"></span><span class="badgecount">7</span></span><span title="20 bronze badges"><span class="badge3"></span><span class="badgecount">20</span></span>
<td class="votecell"></td>
<div id="comments-9607101" class="comments ">
<tbody data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
<tr id="comment-12187969" class="comment ">
<td class="comment-actions">
<td class=" comment-score">
<span title="number of 'useful comment' votes received" class="cool">1</span>
<td class="comment-text">
<div style="display: block;" class="comment-body">
<span class="comment-copy">So you need to remove all the div tags but not the content between the div. Am I right?</span>
–&nbsp;<a href="/users/500725/siva-charan" title="14,075 reputation" class="comment-user">Siva Charan</a>
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12187969_9607101"><span title="2012-03-07 18:34:11Z" class="relativetime-clean">Mar 7 '12 at 18:34</span></a></span>
<tr id="comment-12189778" class="comment ">
<td class=" comment-score">
<td class="comment-text">
<div style="display: block;" class="comment-body">
<span class="comment-copy"><a href="http://stackoverflow.com/a/4667535/208809">Replace the XPath with <code>//div[not[@*]]</code></a> to remove all div elements (incl. content) without attributes.</span>
–&nbsp;<a href="/users/208809/gordon" title="225,421 reputation" class="comment-user">Gordon</a>
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12189778_9607101"><span title="2012-03-07 19:58:21Z" class="relativetime-clean">Mar 7 '12 at 19:58</span></a></span>
<span class="edited-yes" title="this comment was edited 2 times"></span>
<div id="comments-link-9607101" data-rep="50" data-anon="true">
<a class="js-add-link comments-link disabled-link " title="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.">add a comment</a><span class="js-link-separator dno"> | </span>
<a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick=""></a>
echo strip_divs($html);
strip_tags($str, '<div>');