我有大量数据,我通过生成器/迭代器访问它们。在处理数据集时,我需要确定该数据集中的任何记录是否具有与正在处理的当前记录的属性具有相同值的属性。一种方法是使用嵌套的 for 循环。例如,如果要处理学生数据库,我可以执行以下操作:
def fillStudentList():
# TODO: Add some code here to filll
# a student list
pass
students = fillStudentList()
sameLastNames = list()
for student1 in students1:
students2 = fillStudentList()
for student2 in students2:
if student1.lastName == student2.lastName:
sameLastNames.append((student1, student2))
当然,上面的代码片段可以改进很多。该片段的目标是显示嵌套的 for 循环模式。
现在假设我们有一个 Student 类,一个 Student 类(它是一个迭代器)和一个 Source 类,它以某种内存有效的方式(比如另一个迭代器)提供对数据的访问......
下面,我勾勒出这段代码的样子。有人对如何改进此实施有想法吗?目标是能够在非常大的数据集中找到具有相同属性的记录,以便可以处理过滤后的集合。
#!/usr/bin/python
from itertools import ifilter
class Student(object):
"""
A class that represents the first name, last name, and
grade of a student.
"""
def __init__(self, firstName, lastName, grade='K'):
"""
Initializes a Student object
"""
self.firstName = firstName
self.lastName = lastName
self.grade = grade
class Students(object):
"""
An iterator for a collection of students
"""
def __init__(self, source):
"""
"""
self._source = source
self._source_iter = source.get_iter()
self._reset = False
def __iter__(self):
return self
def next(self):
try:
if self._reset:
self._source_iter = self._source.get_iter()
self._reset = False
return self._source_iter.next()
except StopIteration:
self._reset = True
raise StopIteration
def select(self, attr, val):
"""
Return all of the Students with a given
attribute
"""
#select_iter = self._source.get_iter()
select_iter = self._source.filter(attr, val)
for selection in select_iter:
# if (getattr(selection, attr) == val):
# yield selection
yield(selection)
class Source(object):
"""
A source of data that can provide an iterator to
all of the data or provide an iterator to the
data based on some attribute
"""
def __init__(self, data):
self._data = data
def get_iter(self):
"""
Return an iterator to the data
"""
return iter(self._data)
def filter(self, attr, val):
"""
Return an iterator to the data filtered by some
attribute
"""
return ifilter(lambda rec: getattr(rec, attr) == val, self._data)
def test_it():
"""
"""
studentList = [Student("James","Smith","6"),
Student("Jill","Jones","6"),
Student("Bill","Deep","5"),
Student("Bill","Sun","5")]
source = Source(studentList)
students = Students(source)
for student in students:
print student.firstName
for same_names in students.select('firstName', student.firstName):
if same_names.lastName == student.lastName:
continue
else:
print " %s %s in grade %s has your same first name" % \
(same_names.firstName, same_names.lastName, same_names.grade)
if __name__ == '__main__':
test_it()