regex - 我可以使用正则表达式将标题复制到每个条目，直到下一个标题？（电子书中的超链接尾注）

Question

好的，正则表达式忍者。我正在尝试设计一种模式来将超链接添加到 ePub 电子书 XHTML 文件中的尾注。问题是编号在每章内重新开始，所以我需要在锚名称中添加一个唯一标识符，以便散列到它的链接。

给定一个（非常简化的）这样的列表：

<h2>Introduction</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

我需要把它变成这样的东西：

<h2>Introduction</h2>
<a name="endnote-introduction-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-introduction-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-introduction-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-introduction-4"></a><p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<a name="endnote-chapter-1-the-beginning-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-chapter-1-the-beginning-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-chapter-1-the-beginning-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-chapter-1-the-beginning-4"></a><p> 4 Endnote entry number four.</p>

显然，需要在本书的实际文本中进行类似的搜索，每个尾注都将链接到endnotes.xhtml#endnote-introduction-1等等。

最大的障碍是每个匹配搜索都在前一个搜索结束之后开始，因此除非您使用递归，否则您无法匹配多个条目的同一位（在本例中为标题）。然而，到目前为止，我对递归的尝试只产生了无限循环。

我正在使用 TextWrangler 的 grep 引擎，但如果您在不同的编辑器（例如 vim）中有解决方案，那也没关系。

谢谢！

score 1 · Accepted Answer

我认为这在文本编辑器中很难完成，因为它需要两个步骤。首先您需要将文件分成章节，然后您需要处理每个章节的内容。假设“尾注段落”（您希望添加锚点的位置）被定义为第一个单词等于整数单词的段落，那么这个 PHP 脚本将满足您的需求。

<?php
$data = file_get_contents('testdata.txt');
$output = processBook($data);
file_put_contents('testdata_out.txt', $output);
echo $output;

// Main function to process book adding endnote anchors.
function processBook($text) {
    $re_chap = '%
        # Regex 1: Get Chapter.
        <h2>([^<>]+)</h2>  # $1: Chapter title.
        (                  # $2: Chapter contents.
          .+?              # Contents are everything up to
          (?=<h2>|$)       # next chapter or end of file.
        )                  # End $2: Chapter contents.
        %six';
    // Match and process each chapter using callback function.
    $text = preg_replace_callback($re_chap, '_cb_chap', $text);
    return $text;
}
// Callback function to process each chapter.
function _cb_chap($matches) {
    // Build ID from H2 title contents.
    // Trim leading and trailing ws from title.
    $baseid = trim($matches[1]);
    // Strip all non-space, non-alphanums.
    $baseid = preg_replace('/[^ A-Za-z0-9]/', '', $matches[1]);
    // Append prefix and convert whitespans to single - dash.
    $baseid = 'endnote-'. preg_replace('/ +/', '-', $baseid);
    // Convert to lowercase.
    $baseid = strtolower($baseid);
    $text = preg_replace(
                '/(<p>\s*)(\d+)\b/',
                '<a name="'. $baseid .'-$2"></a>$1$2',
                $matches[2]);
    return '<h2>'. $matches[1] .'</h2>'. $text;

}
?>

此脚本正确处理您的示例数据。

score 1 · Accepted Answer

一点 awk 可能会奏效：

创建以下脚本（我将其命名为 add_endnote_tags.awk）：

/^<h2>/ {
    i = 0;
    chapter_name = $0;
    gsub(/<[^>]+>/, "", chapter_name);
    chapter_name = tolower(chapter_name);
    gsub(/[^a-z]+/, "-", chapter_name);
    print;
}

/^<p>/ {
    i = i + 1;
    printf("<a name=\"endnote-%s-%d\"></a>%s\n", chapter_name, i, $0);
}

$0 !~ /^<h2>/ && $0 !~ /^<p>/ {
    print;
}

然后用它来解析你的文件：

awk -f add_endnote_tags.awk < source_file.xml > dest_file.xml

希望有帮助。如果您在 Windows 平台上，您可能需要通过安装cygwin和 awk 包或下载gawk for Windows来安装 awk

regex - 我可以使用正则表达式将标题复制到每个条目，直到下一个标题？（电子书中的超链接尾注）

2 回答 2

Related

Reference