1

I have bunch of html documents 10-15 on which i have to apply LDA algorithm in gensim I am stuck on creating the corpus as i don't understand how i design a corpus for a collection of html documents. The example on the site shows the creation of them on wikipedia compressed file .xml.bz

Anyone please guide me how can i apply LDA on bunch of html documents. Thanks in advance

4

1 回答 1

1

查看 HTML 处理库,例如lxmlbeautifulsoup.

对于更高级别的处理(删除样板,从 HTML 中提取纯文本),请查看例如 Honza Pomikalek 的jusText包。

一旦你有了纯文本文件,你就可以按照gensim 的教程继续。

于 2014-03-18T23:45:01.367 回答