我需要从前列腺切除术最终诊断报告的平面文件中提取 Gleason 分数。这些分数总是有 Gleason 这个词和两个数字相加成另一个数字。人类在二十多年内输入了这些。包括各种空格和修饰符的约定。以下是迄今为止我的 Backus-Naur 表格,以及两个示例记录。仅针对前列腺切除术,我们正在研究超过一千个病例。
我使用 pyparsing 是因为我正在学习 python,并且对我对正则表达式写作的有限接触没有美好的回忆。
我的问题:如何在不解析可能出现在这些最终诊断中或可能不在这些最终诊断中的所有其他可选数据的情况下提取这些 Gleason 等级?
num = Word(nums)
record ::= accessionDate + accessionNumber + patMedicalRecordNum + finalDxText
accessionDate ::= num + "/" + num + "/" num
accessionNumber ::= "S" + num + "-" + num
patMedicalRecordNum ::= num + "/" + num + "-" + num + "-" + num
finalDxText ::= listOfParts + optionalComment + optionalpTNMStage
listOfParts ::= OneOrMore(part)
part ::= <multiline idiosyncratic freetext which may contain a Gleason score I want> + optionalpTNMStage
optionalComment ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
optionalpTNMStage ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
01/01/11 S11-55555 20/444-55-6666 A. PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:
- ADENOCARCINOMA.
TOTAL GLEASON SCORE: GLEASON 5+4=9
TUMOR LOCATION: BILATERAL
TUMOR QUANTITATION: 15% OF PROSTATE INVOLVED BY TUMOR
EXTRAPROSTATIC EXTENSION: PRESENT AT RIGHT POSTERIOR
SEMINAL VESICLE INVASION: PRESENT
MARGINS: UNINVOLVED
LYMPHOVASCULAR INVASION: PRESENT
PERINEURAL INVASION: PRESENT
LYMPH NODES (SPECIMENS B AND C):
NUMBER EXAMINED: 25
NUMBER INVOLVED: 1
DIAMETER OF LARGEST METASTASIS: 1.7 mm
ADDITIONAL FINDINGS: HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,
ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVE
CARCINOMA
PATHOLOGIC STAGE: pT3b N1 MX
B. LYMPH NODES, RIGHT PELVIC, EXCISION:
- ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).
C. LYMPH NODES, LEFT PELVIC, EXCISION:
- EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).
01/02/11 S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:
- ADENOCARCINOMA.
GLEASON SCORE: 3 + 3 = 6 WITH TERTIARY PATTERN OF 5.
TUMOR QUANTITATION: APPROXIMATELY 10% BY VOLUME.
TUMOR LOCATION: BILATERAL.
EXTRAPROSTATIC EXTENSION: NOT IDENTIFIED.
MARGINS: NEGATIVE.
PERINEURAL INVASION: IDENTIFIED.
LYMPH-VASCULAR INVASION: NOT IDENTIFIED.
SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.
LYMPH NODES: NONE SUBMITTED.
OTHER: HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.
PATHOLOGIC STAGE (pTNM): pT2c NX.
全面披露:我是一名从事研究的医生;这是我第一次真正使用 python 工作。我读过 Lutz 的 Learning Python,Shaw 的 Learning Python the Hard Way,并解决了各种问题集。我在这个论坛 pyparsing wiki 上查看了许多与 pyparsing 相关的问题,我购买并阅读了 McGuire 先生的 Pyparsing 入门。也许我在问一个问题,当我真的应该被告知我正站在“当你必须编写解析器时,这种沮丧的死亡螺旋很常见”(McGuire,17)?我不知道。到目前为止,我很高兴能够从事可能实际上是一个真正的项目。