0

我有一大堆需要解析的文本文件(见下文)。它们包含我想要捕获并用于创建关联记录的章节信息。Report has_many :chapters

基本上我需要阅读每一行并为每个BookmarkTitle捕获章节名称(忽略 CR 
),然后捕获BookmarkPageNumber. 然后捆绑配对并用它创建一个新记录:report.page.create(title: bookmark_title, page_number: bookmark_page_number)

我玩了一点IO的 readline 但不确定如何捕获内容......也许是正则表达式?还是更多的 Rails 方式?

示例 txt 文件:

InfoKey: Creator
InfoValue: Adobe Acrobat 9.3.4
InfoKey: Producer
InfoValue: Adobe Acrobat 9.34 Paper Capture Plug-in
InfoKey: ModDate
InfoValue: D:20110315193536-04'00'
InfoKey: CreationDate
InfoValue: D:20110208171413-05'00'
PdfID0: 2dab1ce43882a53cbc24dbb839f921f8
PdfID1: 43b19192e920f38f65de0bf0a2be
NumberOfPages: 258
BookmarkTitle: 1980 Field Service Annual Report
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkTitle: TABLE OF CONTENTS
BookmarkLevel: 1
BookmarkPageNumber: 4
BookmarkTitle: LIST OF EXHIBITS
BookmarkLevel: 1
BookmarkPageNumber: 7
BookmarkTitle: I - INTRODUCTION
BookmarkLevel: 1
BookmarkPageNumber: 11
BookmarkTitle: II - EXECUTIVE SUMMARY
BookmarkLevel: 1
BookmarkPageNumber: 16
BookmarkTitle: III - RESULTS AND ANALYSIS OF THE MAINTENANCE USER SURVEY
BookmarkLevel: 1
BookmarkPageNumber: 45
BookmarkTitle: IV - COMPARATIVE ANALYSIS OF BIGCO AND OTHER MAINTENANCE VENDORS
BookmarkLevel: 1
BookmarkPageNumber: 102
BookmarkTitle: V - RESULTS OF VENDOR SURVEY
BookmarkLevel: 1
BookmarkPageNumber: 127
BookmarkTitle: VI - SIGNIFICANT VENDOR ACTIVITIES, 1979-1980
BookmarkLevel: 1
BookmarkPageNumber: 190
BookmarkTitle: APPENDIX A:  DEFINITIONS
BookmarkLevel: 1
BookmarkPageNumber: 199
BookmarkTitle: APPENDIX B:  RESEARCH METHODOLOGY
BookmarkLevel: 1
BookmarkPageNumber: 204
BookmarkTitle: APPENDIX C:  SUPPORTING CHARTS
BookmarkLevel: 1
BookmarkPageNumber: 211
BookmarkTitle: APPENDIX D:  USER QUESTIONNAIRE
BookmarkLevel: 1
BookmarkPageNumber: 222
BookmarkTitle: APPENDIX E:  VENDOR QUESTIONNAIRE
BookmarkLevel: 1
BookmarkPageNumber: 237
4

1 回答 1

1
/^BookmarkTitle:\s*(.+?)
\s*BookmarkLevel:\s*(\d+)\s*BookmarkPageNumber:\s*(\d+)\s*$/m

抱歉,我不是 Ruby-On-Rails 开发人员,但该正则表达式将匹配每个书签,并返回:

  • 整个记录
  • 标题(作为子匹配)
  • 级别(作为子匹配)
  • 页码(作为子匹配项)

它确实假定级别和页码是没有空格、逗号或小数的数字。但这很容易改变。

于 2012-09-18T20:42:01.130 回答