html - 匹配 sed 中的任何字符（包括换行符）

Question

我有一个 sed 命令，我想在一个巨大的、可怕的、丑陋的 HTML 文件上运行，该文件是从 Microsoft Word 文档创建的。它应该做的就是删除字符串的任何实例

style='text-align:center; color:blue;
exampleStyle:exampleValue'

我试图修改的 sed 命令是

sed "s/ style='[^']*'//" fileA > fileB

它工作得很好，除了匹配文本中有新行时，它不匹配。是否有 sed 的修饰符，或者我可以做些什么来强制匹配任何字符，包括换行符？

我知道正则表达式在 XML 和 HTML 中很糟糕，等等等等，但是在这种情况下，字符串模式是格式良好的，因为样式属性总是以单引号开头并以单引号结尾。因此，如果我能解决换行问题，我可以只用那个命令将 HTML 的大小减少 50% 以上。

最后，事实证明 Sinan Ünür 的 perl 脚本效果最好。它几乎是瞬间完成的，它将文件大小从 2.3 MB 减少到 850k。好老的 Perl...

score 4 · Accepted Answer

sed逐行遍历输入文件，这意味着，据我了解，您想要的内容在sed.

不过，您可以使用以下 Perl 脚本（未经测试）：

#!/usr/bin/perl

use strict;
use warnings;

{
    local $/; # slurp mode
    my $html = <>;
    $html =~ s/ style='[^']*'//g;
    print $html;
}

__END__

一个班轮将是：

$ perl -e 'local $/; $_ = <>; s/ style=\047[^\047]*\047//g; print' fileA > fileB

score 4 · Accepted Answer

sed逐行读取输入，所以要处理一行并不简单……但也不是不可能，需要利用sed分支。以下将起作用，我已经对其进行了评论以解释发生了什么（不是最易读的语法！）：

sed "# if the line matches 'style='', then branch to label, 
     # otherwise process next line
     /style='/b style
     b
     # the line contains 'style', try to do a replace
     : style
     s/ style='[^']*'//
     # if the replace worked, then process next line
     t
     # otherwise append the next line to the pattern space and try again.
     N
     b style
 " fileA > fileB

score 1 · Accepted Answer

1

tr您可以使用、 run删除所有 CR/LF sed，然后导入自动格式化的编辑器。

于 2009-07-22T12:38:42.113 回答

score 1 · Accepted Answer

你可以试试这个：

awk '/style/&&/exampleValue/{
    gsub(/style.*exampleValue\047/,"")
}
/style/&&!/exampleValue/{     
    gsub(/style.* /,"")
    f=1        
}
f &&/exampleValue/{  
  gsub(/.*exampleValue\047 /,"")
  f=0
}
1
' file

输出：

# more file
this is a line
    style='text-align:center; color:blue; exampleStyle:exampleValue'
this is a line
blah
blah
style='text-align:center; color:blue;
exampleStyle:exampleValue' blah blah....

# ./test.sh
this is a line

this is a line
blah
blah
blah blah....

score 1 · Accepted Answer

另一种方式是：

$ cat toreplace.txt 
I want to make \
this into one line

I also want to \
merge this line

$ sed -e 'N;N;s/\\\n//g;P;D;' toreplace.txt

输出：

I want to make this into one line

I also want to merge this line

N加载另一行，P打印模式空间直到第一个换行符，并删除D模式空间直到第一个换行符。

score 0 · Accepted Answer

跨多行删除 XML 元素

我的用例几乎相同，但我需要匹配 XML 元素中的开始和结束标记并完全删除它们——包括里面的任何内容。

<xmlTag whatever="parameter that holds in the tag header">
    <whatever_is_inside/>
    <InWhicheverFormat>
        <AcrossSeveralLines/>
    </InWhicheverFormat>
</xmlTag>

尽管如此，仍然sed可以在一条线上工作。我们在这里所做的是欺骗它将后续行附加到当前行，以便我们可以编辑我们喜欢的所有行，然后重写输出（\n是一个合法的字符，您可以输出以sed再次划分行）。

受@beano 的答案和Unix stackExchange 中的另一个答案的启发，我构建了我的工作 sed “程序”：

 sed -s --in-place=.back -e '/\(^[ ]*\)<xmlTag/{  # whenever you encounter the xmlTag
       $! {                                       # do
            :begin                                # label to return to
            N;                                    # append next line
            s/\(^[ ]*\)<\(xmlTag\)[^·]\+<\/\2>//; # Attempt substitution (elimination) of pattern
            t end                                 # if substitution succeeds, jump to :end
            b begin                               # unconditional jump to :begin to append yet another line
            :end                                  # label to mark the end
          }
       }'  myxmlfile.xml

一些解释：

我匹配<xmlTag而不关闭，>因为我的 XML 元素包含参数。
前面<xmlTag是一个非常有用的正则表达式来匹配任何现有的缩进：\(^[ ]*\)所以你以后可以用 just 输出它\1（即使这次不需要它）。
在几个地方添加，;以便sed理解命令（N或s任何一个）在那里结束，后面的字符是另一个命令。
我的大部分麻烦是试图找到一个匹配“介于两者之间的任何东西”的正则表达式。我终于解决了·（即[^·]\+）以外的任何问题，指望在任何数据文件中都没有那个字符。我需要 scape+因为对 GNU sed 来说是特殊的。
我的原始文件保留为 .back，以防万一出现问题——修改后测试仍然失败——并且版本控制很容易将其标记为批量删除。

我使用这种 sed 自动化来改进 .XML 文件，我们使用这些文件与序列化数据一起运行我们的单元和集成测试。每当我们的类发生变化（松散或增益字段）时，都必须更新数据。我用一个“find”来做到这一点，它在包含修改后的类的文件中执行 sed 自动化。我们拥有数百个 xml 数据文件。

html - 匹配 sed 中的任何字符（包括换行符）

6 回答 6

跨多行删除 XML 元素

Related

Reference