linux - 删除文件 txt 中所有没有模式的字符

Question

我有一个非常大的文件，其中包含模式信息：

 0 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>274</font>
 1 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>284</font>
 2 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>299</font>
 3 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>296</font>
 4 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>273</font>

我想将此行更改为

模式是：

'#4e9a06'>[0-9]*</font>

我用这个：

perl -i.bak -pe 's/.*4e9a06//' copy.txt

但我仍然有：

'>274</font>
'>284</font>
'>299</font>
'>296</font>
'>273</font>
'>272</font>

我尝试使用 sed ：

cat file.bak | sed 's/form>/ /g' > copy2.txt

但这行不通。你能帮我删除其余的字符吗？谢谢你的回答。

score 2 · Accepted Answer

假设您有一个名为的文件copy.txt，其中存储了您的信息。然后你只需运行：

cat copy.txt |egrep -o ">[0123456789]+<"|tr -d  "<"|tr -d ">"

这会打印文件的行，然后只输出匹配的正则表达式的一部分（而不是整行，就像 egrep 那样）。然后你把“<”和“>”剪掉，这也是匹配的。

-编辑-

也许更友好的语法和一些额外的修复。

cat copy.txt |egrep -o ">[1-9][0-9]*<"|tr -d  "<"|tr -d ">"

这里的数字必须以 1 到 9 的数字开头。然后其他数字可能存在也可能不存在。

score 0 · Accepted Answer

0

请尝试以下方法：

sed -e "s#.*>\([0-9]*\)</font>\$#\\1#" source.txt >out.txt

于 2012-10-01T13:22:07.663 回答

score 0 · Accepted Answer

我有使用 Python 的解决方案：

$ python -c 'import re,sys; print "\n".join(",".join(j for j in re.findall("06'\''>(.*)</fo", i)) for i in sys.stdin)' <xy
274
284
299
296
273

不是一个好的程序，但我打算将其作为单行程序来完成。

score 0 · Accepted Answer

请不要使用正则表达式解析 html。

cat<<EOF | html2text | perl -lne 'print for /int (\d+)/g'
0 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>274</font>
1 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>284</font>
2 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>299</font>
3 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>296</font>
4 <font color='#888a85'>=&gt;</font> <small>int</small> <font color='#4e9a06'>273</font>
EOF

输出：

linux - 删除文件 txt 中所有没有模式的字符

4 回答 4

Related

Reference