我需要从网站上提取信息。如果你去这个网站,左边会有一个项目列表,如果你点击一个选项,在右边你会得到一个有名称和代码的表格。我需要创建一个数据框,其中包含从网站上抓取的代码和名称列?在某些选项中,它没有提供名称和代码表,应该跳过。
输出数据框列:
Name Code
我需要从网站上提取信息。如果你去这个网站,左边会有一个项目列表,如果你点击一个选项,在右边你会得到一个有名称和代码的表格。我需要创建一个数据框,其中包含从网站上抓取的代码和名称列?在某些选项中,它没有提供名称和代码表,应该跳过。
输出数据框列:
Name Code
您可以使用此脚本从该站点获取所有 ID、名称和代码:
import re
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://taxonomy.nucc.org/"
page_url = "https://taxonomy.nucc.org/Default/GetContentByItemId/"
html_doc = requests.get(url).text
treenodes = re.search(r"var treenodes = (\[.*\]);", html_doc)
treenodes = json.loads(treenodes.group(1))
all_data = []
for n in treenodes:
data = requests.get(page_url + str(n["id"])).json()
soup = BeautifulSoup(data.get("PartialViewHtml", ""), "html.parser")
code = soup.select_one('label[for="Code"]')
code = code.find_next("td").get_text(strip=True) if code else None
print(n["id"], n["name"], code)
all_data.append(
{
"id": n["id"],
"name": n["name"],
"code": code,
}
)
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
印刷:
...
82 1863 Attendant Care Provider 3747A0650X
83 1864 Personal Care Attendant 3747P1801X
84 1866 Advanced Practice Dental Therapist 125K00000X
85 1867 Dental Assistant 126800000X
86 1868 Dental Hygienist 124Q00000X
87 1869 Dental Laboratory Technician 126900000X
88 1870 Dental Therapist 125J00000X
89 1871 Dentist 122300000X
90 1884 Denturist 122400000X
91 1885 Oral Medicinist 125Q00000X
92 1872 Dental Public Health 1223D0001X
93 1873 Dentist Anesthesiologist 1223D0004X
94 1874 Endodontics 1223E0200X
95 1875 General Practice 1223G0001X
96 1876 Oral and Maxillofacial Pathology 1223P0106X
97 1877 Oral and Maxillofacial Radiology 1223X0008X
98 1878 Oral and Maxillofacial Surgery 1223S0112X
99 1879 Orofacial Pain 1223X2210X
100 1880 Orthodontics and Dentofacial Orthopedics 1223X0400X
...
并保存data.csv(来自 LibreOffice 的屏幕截图):