python - LDA for Html Documents in Genism

Question

I have bunch of html documents 10-15 on which i have to apply LDA algorithm in gensim I am stuck on creating the corpus as i don't understand how i design a corpus for a collection of html documents. The example on the site shows the creation of them on wikipedia compressed file .xml.bz

Anyone please guide me how can i apply LDA on bunch of html documents. Thanks in advance

score 1 · Accepted Answer

查看 HTML 处理库，例如lxml或beautifulsoup.

对于更高级别的处理（删除样板，从 HTML 中提取纯文本），请查看例如 Honza Pomikalek 的jusText包。

一旦你有了纯文本文件，你就可以按照gensim 的教程继续。

python - LDA for Html Documents in Genism

1 回答 1

Related

Reference