6

我正在使用 pickle 通过转储根来保存对象图。当我加载根时,它具有所有实例变量和连接的对象节点。但是,我将所有节点保存在类型字典的类变量中。类变量在保存之前已满,但在我解开数据后它是空的。

这是我正在使用的课程:

class Page():

    __crawled = {}

    def __init__(self, title = '', link = '', relatedURLs = []):
        self.__title = title
        self.__link = link
        self.__relatedURLs = relatedURLs
        self.__related = [] 

    @property
    def relatedURLs(self):
        return self.__relatedURLs

    @property
    def title(self):
        return self.__title

    @property
    def related(self):
        return self.__related

    @property
    def crawled(self):
        return self.__crawled

    def crawl(self,url):
        if url not in self.__crawled:
            webpage = urlopen(url).read()
            patFinderTitle = re.compile('<title>(.*)</title>')
            patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
            patFinderRelated = re.compile('<li><a href="([^"]*)"')

            findPatTitle = re.findall(patFinderTitle, webpage)
            findPatLink = re.findall(patFinderLink, webpage)
            findPatRelated = re.findall(patFinderRelated, webpage)
            newPage = Page(findPatTitle,findPatLink,findPatRelated)
            self.__related.append(newPage)
            self.__crawled[url] = newPage
        else:
            self.__related.append(self.__crawled[url])

    def crawlRelated(self):
        for link in self.__relatedURLs:
            self.crawl(link)

我像这样保存它:

with open('medTwiceGraph.dat','w') as outf:
    pickle.dump(root,outf)

我像这样加载它:

def loadGraph(filename): #returns root
    with open(filename,'r') as inf:
        return pickle.load(inf)

root = loadGraph('medTwiceGraph.dat')

除类变量 __crawled 之外的所有数据都加载。

我究竟做错了什么?

4

4 回答 4

7

Python 并没有真正腌制类对象。它只是保存他们的名字和在哪里可以找到他们。从以下文档pickle

类似地,类是由命名引用腌制的,因此在 unpickling 环境中应用相同的限制。请注意,没有任何类的代码或数据是 pickled,因此在以下示例中,类属性attr不会在 unpickling 环境中恢复:

class Foo:
    attr = 'a class attr'

picklestring = pickle.dumps(Foo)

这些限制是为什么必须在模块的顶层定义可提取函数和类的原因。

同样,当类实例被腌制时,它们的类的代码和数据不会与它们一起被腌制。只有实例数据被腌制。这是有意完成的,因此您可以修复类中的错误或向类添加方法,并且仍然加载使用该类的早期版本创建的对象。如果您计划拥有可以看到多个版本的类的长期对象,则可能值得在对象中放置版本号,以便可以通过类的__setstate__()方法进行适当的转换。

在您的示例中,您可以解决将问题更改__crawled为实例属性或全局变量的问题。

于 2013-05-19T19:11:42.687 回答
4

默认情况下,pickle 只会使用你认为你想要的内容,self.__dict__而不使用你想要的内容。self.__class__.__dict__

我说,“你认为你想要什么”,因为取消一个实例不应该改变类级别的状态。

如果您想更改此行为,请__getstate__查看__setstate__ 文档

于 2013-05-19T17:42:13.207 回答
1

对于任何感兴趣的人,我所做的是创建一个超类 Graph,其中包含一个实例变量 __crawled 并将我的爬行函数移动到 Graph 中。页面现在只包含描述页面及其相关页面的属性。我腌制我的 Graph 实例,其中包含我所有的 Page 实例。这是我的代码。

from urllib import urlopen
#from bs4 import BeautifulSoup
import re
import pickle

###################CLASS GRAPH####################
class Graph(object):
    def __init__(self,roots = [],crawled = {}):
        self.__roots = roots
        self.__crawled = crawled
    @property
    def roots(self):
        return self.__roots
    @property
    def crawled(self):
        return self.__crawled
    def crawl(self,page,url):
        if url not in self.__crawled:
            webpage = urlopen(url).read()
            patFinderTitle = re.compile('<title>(.*)</title>')
            patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
            patFinderRelated = re.compile('<li><a href="([^"]*)"')

            findPatTitle = re.findall(patFinderTitle, webpage)
            findPatLink = re.findall(patFinderLink, webpage)
            findPatRelated = re.findall(patFinderRelated, webpage)
            newPage = Page(findPatTitle,findPatLink,findPatRelated)
            page.related.append(newPage)
            self.__crawled[url] = newPage
        else:
            page.related.append(self.__crawled[url])

    def crawlRelated(self,page):
        for link in page.relatedURLs:
            self.crawl(page,link)
    def crawlAll(self,obj,limit = 2,i = 0):
        print 'number of crawled pages:', len(self.crawled)
        i += 1
        if i > limit:
            return
        else:
            for rel in obj.related:
                print 'crawling', rel.title
                self.crawlRelated(rel)
            for rel2 in obj.related:
                self.crawlAll(rel2,limit,i)          
    def loadGraph(self,filename):
        with open(filename,'r') as inf:
            return pickle.load(inf)
    def saveGraph(self,obj,filename):
        with open(filename,'w') as outf:
            pickle.dump(obj,outf)
###################CLASS PAGE#####################
class Page(Graph):
    def __init__(self, title = '', link = '', relatedURLs = []):
        self.__title = title
        self.__link = link
        self.__relatedURLs = relatedURLs
        self.__related = []      
    @property
    def relatedURLs(self):
        return self.__relatedURLs 
    @property
    def title(self):
        return self.__title
    @property
    def related(self):
        return self.__related
####################### MAIN ######################
def main(seed):
    print 'doing some work...'
    webpage = urlopen(seed).read()

    patFinderTitle = re.compile('<title>(.*)</title>')
    patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
    patFinderRelated = re.compile('<li><a href="([^"]*)"')

    findPatTitle = re.findall(patFinderTitle, webpage)
    findPatLink = re.findall(patFinderLink, webpage)
    findPatRelated = re.findall(patFinderRelated, webpage)

    print 'found the webpage', findPatTitle

    #root = Page(findPatTitle,findPatLink,findPatRelated)
    G = Graph([Page(findPatTitle,findPatLink,findPatRelated)])
    print 'crawling related...'
    G.crawlRelated(G.roots[0])
    G.crawlAll(G.roots[0])  
    print 'now saving...'
    G.saveGraph(G, 'medTwiceGraph.dat')
    print 'done'
    return G
#####################END MAIN######################

#'http://medtwice.com/am-i-pregnant/'
#'medTwiceGraph.dat'

#G = main('http://medtwice.com/menopause-overview/')
#print G.crawled


def loadGraph(filename):
    with open(filename,'r') as inf:
        return pickle.load(inf)

G = loadGraph('MedTwiceGraph.dat')
print G.roots[0].title
print G.roots[0].related
print G.crawled

for key in G.crawled:
    print G.crawled[key].title
于 2013-05-20T03:25:14.200 回答
-2

使用dill可以解决这个问题。
dill包:https
://pypi.python.org/pypi/dill 参考:https ://stackoverflow.com/a/28543378/6301132

根据Asker的代码,进入这个:

#notice:open the file in binary require

#save
with open('medTwiceGraph.dat','wb') as outf:
    dill.dump(root,outf)
#load
def loadGraph(filename): #returns root
    with open(filename,'rb') as inf:
        return dill.load(inf)

root = loadGraph('medTwiceGraph.dat')

我写了另一个例子:

#Another example (with Python 3.x)

import dill
import os

class Employee: 

    def __init__ (self ,name='',contact={}) :
        self.name = name
        self.contact = contact

    def print_self(self):
        print(self.name, self.contact)

#save
def save_employees():
    global emp
    with open('employees.dat','wb') as fh:
        dill.dump(emp,fh)

#load
def load_employees():
    global emp
    if os.path.exists('employees.dat'):
        with open('employees.dat','rb') as fh:
            emp=dill.load(fh)

#---
emp=[]
load_employees()
print('loaded:')
for tmpe in emp:
    tmpe.print_self()

e=Employee() #new employee
if len(emp)==0:
    e.name='Jack'
    e.contact={'phone':'+086-12345678'}
elif len(emp)==1:
    e.name='Jane'
    e.contact={'phone':'+01-15555555','email':'a@b.com'}
else:
    e.name='sb.'
    e.contact={'telegram':'x'}
emp.append(e)

save_employees()
于 2016-05-06T15:55:07.823 回答