1

I need to extract individual records from log files generated from a fairly archaic system and get them ready for database input. These flat files are all I can extract (and just formatting the query took weeks). Here is an example of a file with two records. The only delimiter I see is "/11 S11-" which is itself at a regular spot 5 characters in, but not quite at the beginning or end.

For those watching, yes, this is related to my other newb question. I have looked at the python documentation, some google results, and some related questions. So, my questions are

a) how to use a delimiter that starts 5 characters into the record?

b) how to grab these big chunks of natural language?

c) how to get rid of the whitespace after newlines? This is probably the easiest part: I can specify in the query how much long each field is. Right now, the accessionDate is 10 characters long, the accessionNumber is 10 characters long, the patMedicalRecordNum is 15 characters long. So the whitespace on the finalDxText is 35 characters.

01/01/11  S11-55555 20/444-55-6666 A.  PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                           
                                   -  ADENOCARCINOMA.                                                      

                                   TOTAL GLEASON SCORE:  GLEASON 5+4=9                                     
                                   TUMOR LOCATION:  BILATERAL                                              
                                   TUMOR QUANTITATION:  15% OF PROSTATE INVOLVED BY TUMOR
                                   EXTRAPROSTATIC EXTENSION:  PRESENT AT RIGHT POSTERIOR                   
                                   SEMINAL VESICLE INVASION:  PRESENT                                      
                                   MARGINS:  UNINVOLVED                                                    
                                   LYMPHOVASCULAR INVASION:  PRESENT                                       
                                   PERINEURAL INVASION:  PRESENT                                           
                                   LYMPH NODES (SPECIMENS B AND C):                                        
                                      NUMBER EXAMINED:  25                                                 
                                      NUMBER INVOLVED:  1                                                  
                                      DIAMETER OF LARGEST METASTASIS:  1.7 mm                              
                                   ADDITIONAL FINDINGS:  HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,   
                                      ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVE    
                                      CARCINOMA                                                            

                                   PATHOLOGIC STAGE:  pT3b N1 MX                                           

                               B.  LYMPH NODES, RIGHT PELVIC, EXCISION:                                    
                                   -  ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).         

                               C.  LYMPH NODES, LEFT PELVIC, EXCISION:                                     
                                   -  EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).                     
01/02/11  S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                               
                                  - ADENOCARCINOMA.                                                        
                                    GLEASON SCORE:  3 + 3 = 6 WITH TERTIARY PATTERN OF 5.                                             
                                    TUMOR QUANTITATION:  APPROXIMATELY 10% BY VOLUME.                      
                                    TUMOR LOCATION:  BILATERAL.                                            
                                    EXTRAPROSTATIC EXTENSION:  NOT IDENTIFIED.                             
                                    MARGINS:  NEGATIVE.                                                    
                                    PERINEURAL INVASION:  IDENTIFIED.                                      
                                    LYMPH-VASCULAR INVASION:  NOT IDENTIFIED.                              
                                    SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.              
                                    LYMPH NODES:  NONE SUBMITTED.                                          
                                    OTHER:  HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.                
                                   PATHOLOGIC STAGE (pTNM):  pT2c NX. 
4

3 回答 3

2

分隔符

我可能会摆脱困境,但看看你的记录,特别是在01/01/11 S11-55555 20/444-55-666601/01/11对我来说有点像约会。

因此,从您的输入来看:

  • 您可以mm/dd/yy使用例如非常简单的正则表达式和re.match.
  • 看起来每条记录中的数据都是缩进的,所以看起来一行没有缩进意味着它是一个分隔符。

空白

my_string.strip返回my_string去除初始和尾随空格。

于 2012-06-01T20:54:15.303 回答
1

This is an idea:

 chunky = open(file, 'r')
    for line in chunky:
        if line>'00':                            # It's a starting line
            linedata = line.split(None, 3)       # separates line in four pieces
            chunk = linedata[3].strip()
        else:
            chunk += ' ' + line.strip()

And for a newb: a part of a string: line[a:b] in which a is the first you need starting at 0 and b is the first you don't need. Your S11 would be linedata[1][0:3]

于 2012-06-01T21:15:55.597 回答
1

我会尝试这样的事情:

import re                                # regex module

in_string = """Text from above"""

records = []                             # list to store all records in order
record = ""                              # string to store current record

for line in in_string.splitlines():      # go through each line of the input
    if re.match('\d\d/\d\d/\d\d',line):  # match the date at the start 
        records.append(record)           # add current record to list
        record = ""                      # start new current record
    record += line.strip()               # add line (without whitespace) to current record
records.append(record)                   # add last record to records list

这将输出以下内容:

['',

'01/01/11 S11-55555 20/444-55-6666 A. 前列腺和精囊,前列腺切除术:- 腺癌。总 GLEASON 评分: GLEASON 5+4=9 肿瘤位置: 双侧肿瘤定量: 15% 的前列腺参与肿瘤EXTENSION: PRESENT AT RIGHT POSTERIORSEMINAL VESICLE INVASION: PRESENTMARGINS: UNINVOLVEDLYMPHOVASCULAR INVASION: PRESENTPERINEURAL INVASION: PRESENTLYMPH NODES (SPECIMENS B AND C):NUMBER EXAMINED: 25NUMBER INVOLVED: 1DIAMETER OF LARGEST METASTASIS: 1.7 mmADDITIONAL FINDINGS: HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,ACUTE AND慢性炎症,侵袭性癌症病理阶段的导管内扩展:pT3b N1 MXB。淋巴结,右盆腔,切除:- 17 个淋巴结转移阳性之一 (1/17).C. 淋巴结,左盆腔,切除:

'01/02/11 S11-4444 20/111-22-3333 前列腺和精囊,前列腺切除术:- 腺癌。格里森评分:3 + 3 = 6,三级模式为 5. 肿瘤定量:约 10% 按体积.肿瘤位置:双侧。前列腺外扩展:未确定。边缘:阴性。神经周围浸润:已确定。淋巴血管浸润:未确定。精囊/输精管浸润:未确定。淋巴结:未提交。其他:高级别前列腺内皮内.病理阶段(pTNM):pT2c NX。']

注意:这是一个糟糕的正则表达式,它将匹配任何以“nn/nn/nn”开头的行

您可能希望在行之间添加一个空格 - 例如record += line.strip()+' '

祝你好运!


您可以在这里使用正则表达式 (regex/re) - 将您的正则表达式 (ie \d\d/\d\d/\d\d S11) 放在顶部的框中,将您的文本放在底部的框中​​。

于 2012-06-01T21:10:45.460 回答