-2

我需要从网站上提取信息。如果你去这个网站,左边会有一个项目列表,如果你点击一个选项,在右边你会得到一个有名称和代码的表格。我需要创建一个数据框,其中包含从网站上抓取的代码和名称列?在某些选项中,它没有提供名称和代码表,应该跳过。

输出数据框列:

Name   Code
4

1 回答 1

1

您可以使用此脚本从该站点获取所有 ID、名称和代码:

import re
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://taxonomy.nucc.org/"
page_url = "https://taxonomy.nucc.org/Default/GetContentByItemId/"

html_doc = requests.get(url).text

treenodes = re.search(r"var treenodes = (\[.*\]);", html_doc)
treenodes = json.loads(treenodes.group(1))

all_data = []

for n in treenodes:
    data = requests.get(page_url + str(n["id"])).json()
    soup = BeautifulSoup(data.get("PartialViewHtml", ""), "html.parser")
    code = soup.select_one('label[for="Code"]')
    code = code.find_next("td").get_text(strip=True) if code else None

    print(n["id"], n["name"], code)

    all_data.append(
        {
            "id": n["id"],
            "name": n["name"],
            "code": code,
        }
    )

df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)

印刷:

...
82   1863                            Attendant Care Provider  3747A0650X
83   1864                            Personal Care Attendant  3747P1801X
84   1866                 Advanced Practice Dental Therapist  125K00000X
85   1867                                   Dental Assistant  126800000X
86   1868                                   Dental Hygienist  124Q00000X
87   1869                       Dental Laboratory Technician  126900000X
88   1870                                   Dental Therapist  125J00000X
89   1871                                            Dentist  122300000X
90   1884                                          Denturist  122400000X
91   1885                                    Oral Medicinist  125Q00000X
92   1872                               Dental Public Health  1223D0001X
93   1873                           Dentist Anesthesiologist  1223D0004X
94   1874                                        Endodontics  1223E0200X
95   1875                                   General Practice  1223G0001X
96   1876                   Oral and Maxillofacial Pathology  1223P0106X
97   1877                   Oral and Maxillofacial Radiology  1223X0008X
98   1878                     Oral and Maxillofacial Surgery  1223S0112X
99   1879                                     Orofacial Pain  1223X2210X
100  1880           Orthodontics and Dentofacial Orthopedics  1223X0400X
...

并保存data.csv(来自 LibreOffice 的屏幕截图):

在此处输入图像描述

于 2021-05-01T19:30:14.590 回答