python - Text Extraction from HTML Data

Question

I have have a problem with extracting information from messy HTML data. Basically what I want to do is extract only the actual displayed words from a given piece of HTML code. Here is an example of the raw HTML data I am given

<p>I have an app which send mail to my defined mail address "myemail@own.com". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data</p>

<p>String recepientEmail = "myemail@own.comm"; </p>

<p>// either set to destination email or leave empty</p>

<pre><code>    Intent intent = new Intent(Intent.ACTION_SENDTO);

    intent.setData(Uri.parse("mailto:" + recepientEmail));

    startActivity(intent);
</code></pre>

<p>but on submit it opens gmail or chooser email client view but i dont want to show gmail view</p>

and I want to transform it into this

I have an app which send mail to my defined mail address "myemail@own.com". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "myemail@own.comm"; // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view

So basically just retrieve everything within each of the <p> tags and concatenate them together. I am using python so I am thinking BeautifulSoup is probably the best way to do this, however I can't seem to figure out how to do this. I am also want to repeat this over several such examples (actually millions), but each example should have at least one <p> tag.

score 3 · Accepted Answer

html2text是一个 Python 脚本，可将 HTML 页面转换为干净、易于阅读的纯 ASCII 文本。更好的是，ASCII 也恰好是有效的 Markdown（文本到 HTML 格式）。

<span id="midArticle_1"></span><p>Here is the First Paragraph.</p><span id="midArticle_2"></span><p>Here is the second Paragraph.</p><span id="midArticle_3"></span><p>Paragraph Three."</p>

print html.parse(url).xpath('//p/text()')

输出

['这是第一段。'，'这是第二段。'，
'第三段。']

score 2 · Accepted Answer

一种使用BeautifulSoup模块从<p>标签中提取所有文本的方法。

内容script.py：

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

print(' '.join(map(lambda e: e.string, soup.find_all('p'))))

像这样运行它：

python3 script.py infile

这会产生：

I have an app which send mail to my defined mail address "myemail@own.com". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "myemail@own.comm";  // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view

score 1 · Accepted Answer

我最近开始玩 Beautiful Soup。我发现这行代码非常有用。我将把我的整个例子扔给你看。

import requests
from bs4 import BeautifulSoup

r = requests.get("your url")

html_text = r.text

soup = BeautifulSoup(html_text)

clean_html = ''.join(soup.findAll(text=True))

print(clean_html)

希望这对您有用/回答您的问题

python - Text Extraction from HTML Data

3 回答 3

Related

Reference