regex - 使用python打印特定单词之后的所有单词

Question

假设我有一个包含以下数据的文件：

<td class="w"><a href="show.cgi?id=120012" title="[Title] &#64;Blue: Session_TIMEOUT after 60033 ms">[Title] &#64;Blue: Session_TIMEOUT after 60033 ms</a></td>'
<td class="w"><a href="show.cgi?id=120012" title="[Title] &#64;Blue: Session_TIMEOUT after 60500 ms">[Title] &#64;Blue: Session_TIMEOUT after 60033 ms</a></td>'

在上面的字符串中，我如何在 HTML 标记下的两行中检索 title="[Title] @Blue: Session_TIMEOUT after 60033 ms" 之后的字符串，并在下一行写入检索的字符串。

我想要这样的输出：

<td class="w"><a href="show.cgi?id=120012" title="[Title] &#64;Blue: Session_TIMEOUT after 60033 ms">[Title] &#64;Blue: Session_TIMEOUT after 60033 ms</a></td>'
&#64;Blue: Session_TIMEOUT after 60033 ms
<td class="w"><a href="show.cgi?id=120012" title="[Title] &#64;Blue: Session_TIMEOUT after 60500 ms">[Title] &#64;Blue: Session_TIMEOUT after 60033 ms</a></td>'
&#64;Blue: Session_TIMEOUT after 60500 ms

请帮我做同样的事情....在此先感谢

score 0 · Accepted Answer

使用Beautiful Soup库，您可以很容易地做到这一点：

from BeautifulSoup import BeautifulSoup
myHTML = '<td class="w"><a href="show.cgi?id=120012" title="[Title] &#64;Blue: Session_TIMEOUT after 60033 ms">[Title] &#64;BlueScreen: RCU_PCPU_TIMEOUT after 60033 ms</a></td>'
html_doc = BeautifulSoup( myHTML )
print html_doc.td.a.string

Beautiful Soup可以使用pipor安装easy_install，或者apt-get如果您使用的是基于 debian 的操作系统，则可以根据需要安装：

pip install BeautifulSoup
easy_install BeautifulSoup
apt-get install python-beautifulsoup

score 0 · Accepted Answer

一个简单的方法：

line = line[(line.index('[Title]')+len('[Title]')):]
line = line[(line.index('[Title]')+len('[Title]')):]
text = line[:line.index('</a></td>')]
print line + '\n' + text

虽然，解决此问题的更好方法是使用 CodeChordsman 提到的正则表达式

score 0 · Accepted Answer

您可以使用正则表达式。如果你能看出你的兴趣串总是锚定在比如说title="和结尾之间，ms那么你可以这样做：

import re # 正则表达式模块 g = re.compile('title="(.*?ms)').search(line) # 搜索你的字符串

然后您的字符串将通过g.group(1). 您可能会发现阅读 python 文档中的正则表达式很有用，它是每种语言的非常重要的编程工具，尤其是在脚本中。

您可能还想regex为您的问题添加标签。

regex - 使用python打印特定单词之后的所有单词

3 回答 3

Related

Reference