python - 使用正则表达式解析kindle“My Clippings.txt”文件

Question

我目前正在尝试使用 python 为我的 Kindle 解析笔记文件，这样我可以让它们比 Kindle 自动保存笔记的按时间顺序排列的列表更有条理。不幸的是，我在使用正则表达式解析文件时遇到了麻烦。到目前为止，这是我的代码：

import re


def parse_file(in_file):
    read_file = open(in_file, 'r')
    file_lines = read_file.readlines()
    read_file.close()
    raw_note = "".join(file_lines)

    # Regex parts
    title_regex = "(.+)"
    title_author_regex = "(.+) \((.+)\)"

    loc_norange_regex = "(.+) (Location|on Page) ([0-9]+)"
    loc_range_regex = "(.+) (Location|on Page) ([0-9]+)-([0-9]+)"

    date_regex = "([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)"  # Date
    time_regex = "([0-9]+):([0-9]+) (AM|PM)"  # Time

    content_regex = "(.*)"
    footer_regex = "=+"

    nl_re = "\r*\n"

    # No author
    regex_noauthor_str =\
    title_regex + nl_re +\
    "- Your " + loc_range_regex + " | Added on " +\
    date_regex + ", " + time_regex + nl_re +\
    content_regex + nl_re +\
    footer_regex

    regex_noauthor = re.compile(regex_noauthor_str)
    print regex_noauthor.findall(raw_note)

parse_file("testnotes")

以下是“testnotes”的内容：

Title
- Your Highlight Location 3360-3362 | Added on Wednesday, March 21, 2012, 12:16 AM

Note content goes here
==========

我想要的是：

[('Title', 'Highlight', 'Location', '3360', '3362', 'Wednesday', 'March', '21', '2012', '12', '16', 'AM',

但是当我运行程序时，我得到：

[('Title', 'Highlight', 'Location', '3360', '3362', '', '', '', '', '', '', '', '')]

我对正则表达式相当陌生，但我觉得这应该相当简单。

score 2 · Accepted Answer

你需要|逃避 "- Your " + loc_range_regex + " | Added on " +\

至："- Your " + loc_range_regex + " \| Added on " +\

|是正则表达式中的 OR 运算符。

score 2 · Accepted Answer

2

当你说的时候" | Added on "，你需要逃避了|。将该字符串替换为" \| Added on "

于 2013-06-05T18:43:43.820 回答

score 0 · Accepted Answer

如果有人需要对此进行更新，以下内容可在 2017 年与 Paperwhite & Voyage Kindles 一起使用：https ://gist.github.com/laffan/7b945d256028d2ffaacd4d99be40ca34

python - 使用正则表达式解析kindle“My Clippings.txt”文件

3 回答 3

Related

Reference