goose - 如何使用 Goose 从印地语网页中提取文章？

Question

我正在使用 Python Goose 从网页中提取文章。它适用于许多语言，但不适用于印地语。我试图将印地语停止添加为 stopwords-hi.txt 并将 target_language 设置为 hi，但没有成功。谢谢，伊兰

score 0 · Accepted Answer

是的，我有同样的问题。我一直致力于提取所有印度地区语言的文章，但我无法单独使用 Goose 提取内容。如果您可以单独使用文章描述，那么 meta_description 就可以完美运行。您可以使用它来代替不返回任何内容的cleaned_text。

另一种选择，但代码行更多：

import urllib
from bs4 import BeautifulSoup

url = "http://www.jagran.com/news/national-this-pay-scale-calculator-will-tell-your-new-salary-after-7th-pay-commission-14132357.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")

##removing all script, style and reference links to get only the article content
for script in soup(["script", "style",'a',"href","formfield"]):
    script.extract()  


text = soup.get_text()

lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

print (text)

公开披露：我实际上只是在堆栈溢出的某个地方获得了原始代码。稍微修改了一下。

goose - 如何使用 Goose 从印地语网页中提取文章？

1 回答 1

Related

Reference