0

我正在尝试通过 BeautifulSoup 处理一个 json 文件,但不知道如何实现这一点......

下面是 json 的副本,我正在尝试遍历 json 中的每个 id 并提取某些数据......有人建议不同的路线吗?

{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        { 
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}

提前致谢 - Hyflex

4

4 回答 4

3

我很确定这可以满足您的需求——对于每一行,它会将“文本”属性加载到 BeautifulSoup 中,然后提取您可能想要的所有属性。您可以将其概括为您想要的任何行为——应该是非常易读的。

import json
try:
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
myjson = r"""{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        { 
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}"""

data = json.loads(myjson)

for l in data['line']:
    soup = BeautifulSoup(l['text'])
    #print soup.prettify()
    # Get the H1 ID
    print soup.findAll('h1')[0]['id']
    # Get the text
    print soup.findAll('h1')[0].contents[0].strip()
    # Get the <a> href
    print soup.findAll('a')[0]['href']
    # Get the <a> class
    print soup.findAll('a')[0]['class']
    # Get the <a> text
    print soup.findAll('a')[0].contents[0].strip()
于 2013-10-22T18:36:04.983 回答
2

您不能使用BeautifulSoup. 您可以json像这样使用该模块:

import json
from pprint import pprint

json_data = r"""
{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        {
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             **{
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a hre**f=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}
"""

s = json.loads(json_data)

# Getting the value of the ids
for i in xrange(0, 10):
    pprint(s['line'][i]['text'])

工作链接在这里。您可能会得到 a ValueError,因为您忘记将 放在r字符串声明的前面。

也可以在这上面使用 BeautifulSoup,就像这样,但这会让事情变得很慢:

# Imports
import json
from pprint import pprint
from bs4 import BeautifulSoup

json_data = <as described above>
s = json.loads(json_data)
list_of_html_in_json = [s['line'][i]['text'] for i in xrange(10)]
soup = BeautifulSoup(" ".join(list_of_html_in_json))
print soup.find_all("h1", {"id": "r035"})  # Example

恐怕因为这使用了一个外部库(bs4),我不能给你看代码的在线版本。但是,我向你保证,我已经尝试过并测试过它。

于 2013-10-20T16:34:52.807 回答
1

只是我的尝试:

import requests
import json
from bs4 import BeautifulSoup

# Use requests library to get the JSON data
JSONDATA = requests.request("GET", "http://www.websitehere.com/") #Make sure you include the http part
# Load it with JSON 
JSONDATA = JSONDATA.json()

# Cycle through each `line` in the JSON
for line in JSONDATA['line']:
    # Load stripped html in BeautifulSoup
    soup = BeautifulSoup(line['text'])
    # Prints tidy html
    print soup.prettify()

希望有帮助:)

于 2013-10-22T21:18:01.423 回答
0

最新的beautifulsoup包,现在

从 bs4 导入 BeautifulSoup

当您尝试运行Christian Ternus的上述脚本时,这将帮助您避免遇到麻烦

于 2013-10-22T18:58:30.567 回答