python - 使用 Python 计算 SEC Edgar 10-K 文件正文中的关键字

问问题 2020-04-11T08:46:12.343

340 次

我正在尝试在 Python 3 中解析 SEC Edgar 文本的文本部分，例如：https ://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt

我的目标是收集某些关键字的 10-K 语句的可见文本正文中的出现次数并将它们保存到字典中（即，我对任何表格、展览等都不感兴趣）。

我对 Python 很陌生，如果有任何帮助，我将不胜感激！

这是我到目前为止所写的，但是这里的代码没有返回正确的出现次数，并且它没有捕获最终用户可见的主要文本正文。

import requests
from bs4 import BeautifulSoup

# this part I would like to change such that it only collects words visible to the normal user in the page (is that the body?) 

def count_words(url, the_word):
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content, 'lxml')
    words = soup.find(text=lambda text: text and the_word in text)
    print(words)
    print('*'*20)
    return len(words)


def main():
    url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
    word_list = ['assets']
    for word in word_list:
        count = count_words(url, word)
        print('\nUrl: {}\ncontains {} occurrences of word: {}'.format(url, count, word))
        print('--'*20)

# this part I dont understand 
if __name__ == '__main__':
    main()

python - 使用 Python 计算 SEC Edgar 10-K 文件正文中的关键字

0 回答 0

Related

Reference