作为最终产品,我有一个名为“members”和“pcps”的对象,它们本身实际上是一堆单独的字符串对象。我需要将它们矢量化为一个列表,以便我可以将它们添加到数据框并最终作为 Excel 表
当我从 PDF 中抓取文本数据时,问题出现在某个地方,它没有作为列表中的列表的数据结构。想知道是否围绕我尝试创建“成员”系列的路线,我可以以某种方式将这些单独的对象合并到一个列表中。
def PDFsearch(origFileName):
# creating a pdf File object of original pdf
pdfFileObj = open(origFileName, 'rb')
# creating a pdf Reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
numPages = pdfReader.numPages
print(numPages)
for p in range(pdfReader.numPages):
# creating page object
pageObj = pdfReader.getPage(p)
#extract txt from pageObj into unicode string object
pages = pageObj.extractText()
# loop through string object by page
pges = []
for page in pages.split("\n"):
# split the pages into words
pges.append(page)
lns = []
for lines in page.split(" "):
for line in lines.split(","): #seperate the ,"This" from the last name
lns.append(line)
names = list()
if lns[0] == "Dear": # If first word in a line is "Dear"
names.append(lns[1:4]) # Get the 2nd and 3rd items (First and Last names)
for name in names:
members = " ".join(name) # These are the names we want
PCPs = lns[78:85]
pcps = " ".join(PCPs)
providers = pd.Series(pcps)
members = pd.Series(members)
'''This is what I get when I print the series 'members':
0 LAILIA TAYLOR
dtype: object
0 LATASIA WILLIS
dtype: object
0 LAURYN ALLEN
dtype: object
0 LAYLA ALVARADO
dtype: object
0 LAYLA BORELAND
dtype: object
0 LEANIAH MULLIGAN
dtype: object
All separate objects! Same with 'providers'. and when I create a dataframe and export to excel I only get one row'''