我想我终于找到了适合我的问题的解决方案......我将在这里展示如何以两种方式存储关系。在关系数据库中使用嵌套集模型并使用具有持久性的基于键值的解决方案
解决方案1:拉你的头发代码:嵌套集模型
CREATE TABLE identity
(
id serial NOT NULL,
identity_type_id integer NOT NULL,
"number" character varying(50) NOT NULL,
CONSTRAINT identity_pkey PRIMARY KEY (id),
CONSTRAINT identity_identity_type_id_fkey FOREIGN KEY (identity_type_id)
REFERENCES config_identity_type (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT identity_1 UNIQUE (identity_type_id, number)
)
CREATE TABLE identity_related
(
id serial NOT NULL,
identity_id integer NOT NULL,
is_processed boolean NOT NULL DEFAULT false,
ref_no character varying(20) NOT NULL,
lft integer,
rgt integer,
CONSTRAINT identity_related_pkey PRIMARY KEY (id),
CONSTRAINT identity_related_identity_id_fkey FOREIGN KEY (identity_id)
REFERENCES identity (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
对于我通过的每一行,我获取该行上的所有标识号,生成一个唯一的参考号,然后使用嵌套集模型,我将各自的左右值设置到 identity_related 中。有效。
嵌套集合模型指出的唯一挑战是更新集合是惩罚性的。在我的情况下,我需要检查每个 idnumber 是否已保存,获取它保存的 ID,然后,还要检查它们保存的 idnumber,循环一直持续到最后......
完成迭代后,我会生成一个新的参考编号并设置所有获取的 ID lft & rgt。查询有效。但是对于大约 100 万条 identity_related 条目,这个查询花了 5 天时间,但这仅仅是因为我在第 5 天杀死了它,到那时,它已经完成了大约 700,000 个 ID。
代码如下所示:
def relate_identities(self, is_processed):
#http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
#http://www.sitepoint.com/hierarchical-data-database-2/
#http://www.pure-performance.com/2009/03/managing-hierarchical-data-in-sql/
#http://www.sqlalchemy.org/trac/browser/examples/nested_sets/nested_sets.py
identify = Identity()
session = Session()
entries = []
related = []
tbl = IdentityRelated
not_processed = False
log_counter = 0
id_counter = 0
while True:
#Get the initial record
identity = session.query(tbl).filter(tbl.is_processed == is_processed).order_by(tbl.id).first()
entries.append({identity:'not_processed'})
related.append(identity)
if len(entries) == 0: break
#for key, value in entries[0].items():
#print("ID:%s; ref_no:%s" %(key.id, key.ref_no))
while True:
for entry in entries:
if not_processed == True: break
for key, value in entry.items():
if value == 'not_processed':
not_processed = True
break
if not_processed == False:
break
else:
not_processed = False
for entry in entries:
for key, value in entry.items():
if value == 'not_processed':
#Get objects which have the same identity_id as current object
duplicates = session.query(tbl).filter(tbl.identity_id == key.id).\
order_by(tbl.id).all()
if len(duplicates) != 0:
for duplicate in duplicates:
if not duplicate in related:
related.append(duplicate)
entries.append({duplicate:'not_processed'})
for entry in entries:
for key, value in entry.items():
if value == 'not_processed':
#Get objects that have the same reference numbers as all entries that we have fetched so far
ref_nos = session.query(tbl).filter(tbl.ref_no == key.ref_no).order_by(tbl.id).all()
for ref_no in ref_nos:
if not ref_no in related:
related.append(ref_no)
entries.append({ref_no:'not_processed'})
#Remove current entry from entries
entries.remove(entry)
#Add the entry but change the status
entries.append({key:'processed'})
#Generate a new RelationCode
while True:
ref_no = get_reference_no(REFERENCE_NO.idrelation)
params = {'key':'relation','relation':ref_no}
if identify.get_identity(session, **params) == None:
break
#Add each relatedID to the DB and set the Nested Set Value
#Set is_processed as True to ensure we don't run it again
relation_counter = 0
for entry in entries:
for key, value in entry.items():
key.ref_no = ref_no
key.lft = relation_counter + 1
key.rgt = ((len(related) * 2) - relation_counter)
key.is_processed = True
relation_counter += 1
#Reset values
log_counter += 1
id_counter += 1
related = []
entries = []
#Commit the session
session.commit()
即使此代码经过优化并变得更快,查询相关 ID 也涉及获取我想要的 ID、获取关联的参考号,然后distinct
针对该参考号调用 SQL 搜索以获取与该身份相关的不同 ID 号。
解决方案2:3行代码:NoSQL - Redis Key:Value Sets
回到绘图板即谷歌。是否搜索了“存储相关身份号码”是的,我是那么绝望……我在 instagram 上获得了一篇文章在 Redis 中存储数亿个简单的键值对。假设Redis是我最好的新朋友,尤其是因为我花了 10 分钟阅读介绍,5 分钟完成安装,40 分钟完成 3 个基本教程。在那之后,我花了 3 个小时才真正解决了我的问题,而对于 Redis,这基本上意味着试图找出最有效的方式来存储我的身份号码的键:值对。现在我想我已经在 4 行代码中使用Redis Sets解决了它。
在获取了三个一起提交的身份号码后,我创建了一个名为关系的列表。使用 Redis Sets
,您不能有重复的值,因此即使我的三个身份号码被多次提交,我的集合的长度也永远不会增长,并且我不会像上面的关系数据库那样有重复的值。如果添加了额外的第 4 个 ID,那么我的集合会增加 1。重要的是,对于相同数量的身份,此代码需要 2 小时 23 分钟,总内存消耗为:'used_memory_peak_human':'143.11M'
for related_outer in relation:
#Create a set using the ID_Number as the key, and the other ID numbers as the values
for related_inner in relation:
redis_db.sadd(related_outer.number, related_inner.number)
我最好的新朋友。雷迪斯……
我欢迎提供信息以改进上述内容或存储关系的全新方式。