1

作为最终产品,我有一个名为“members”和“pcps”的对象,它们本身实际上是一堆单独的字符串对象。我需要将它们矢量化为一个列表,以便我可以将它们添加到数据框并最终作为 Excel 表

当我从 PDF 中抓取文本数据时,问题出现在某个地方,它没有作为列表中的列表的数据结构。想知道是否围绕我尝试创建“成员”系列的路线,我可以以某种方式将这些单独的对象合并到一个列表中。


def PDFsearch(origFileName): 

    # creating a pdf File object of original pdf 
    pdfFileObj = open(origFileName, 'rb')  
    # creating a pdf Reader object 
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

    numPages = pdfReader.numPages
    print(numPages)
    for p in range(pdfReader.numPages): 

        # creating page object 
        pageObj = pdfReader.getPage(p)
        #extract txt from pageObj into unicode string object
        pages = pageObj.extractText()
        # loop through string object by page
        pges = []


        for page in pages.split("\n"):
            # split the pages into words
            pges.append(page)

            lns = []            
            for lines in page.split(" "):
                for line in lines.split(","):   #seperate the ,"This" from the last name
                    lns.append(line)

            names = list()
            if lns[0] == "Dear":   # If first word in a line is "Dear"
                names.append(lns[1:4]) # Get the 2nd and 3rd items (First and Last names)              
                for name in names:
                    members = " ".join(name) # These are the names we want

                PCPs = lns[78:85]        
                pcps = " ".join(PCPs)

                providers =  pd.Series(pcps)
                members = pd.Series(members)

'''This is what I get when I print the series 'members':

0    LAILIA TAYLOR 
dtype: object
0    LATASIA WILLIS 
dtype: object
0    LAURYN ALLEN 
dtype: object
0    LAYLA ALVARADO 
dtype: object
0    LAYLA BORELAND 
dtype: object
0    LEANIAH MULLIGAN 
dtype: object

All separate objects!  Same with 'providers'.  and when I create a dataframe and export to excel I only get one row'''

4

1 回答 1

0

快速浏览一下,但我相信您的问题是您每次都在覆盖您的系列。尝试这样的事情:

# add at the beginning of your function 
temp = pd.DataFrame()
data = pd.DataFrame()

# this would replace where you assign to providers and members
temp['providers'] = pd.Series(pcps)
temp['members'] = pd.Series(members)
data = pd.concat([data, temp]).reset_index(drop=True)

这样,您将每次都覆盖 temp,但您的数据 DataFrame 将包含所有成员和提供者。我希望这会有所帮助,祝你好运!

于 2019-09-06T19:11:56.187 回答