regex - 如何使用正则表达式在列表中每个匹配项的第一次出现周围添加`\macro{}`

Question

我有一个单词列表，list.txt，如下所示：

fish
squirrel
bird
tree
mountain

我还有一个文件 text.txt，其中有这样的段落：

The fish ate the birds.
The squirrel lived in the tree on the mountain.
The fish did not like eating squirrels as they lived too high in the trees.

我需要使用 TeX 代码标记文件中所有单词list.txt的第一次出现，例如，例如，输出将如下所示：text.txt\macro{}

The \macro{fish} ate the \macro{bird}s.
The \macro{squirrel} lived in the \macro{tree}house on the \macro{mountain}.
The fish did not like eating squirrels as they lived too high in the trees.

如何添加\macro{}到 BASH 列表中出现的每个单词的第一次出现？

score 2 · Accepted Answer

GNU sed的代码：

$ sed -nr 's#(\w+)#s/\1/\1/;T\1;x;s/\1/\1/;x;t\1;x;s/.*/\& \1/;x;s/\1/\\\\macro\{\1\}/;:\1;$!N#p' list.txt|sed -rf - text.txt

$猫列表.txt
鱼
松鼠
鸟
树
山

$猫文本.txt
鱼吃了鸟。
松鼠住在山上的树上。
这条鱼不喜欢吃松鼠，因为它们住在树上太高了。

$ sed -nr 's#(\w+)#s/\1/\1/;T\1;x;s/\1/\1/;x;t\1;x;s/.*/\ & \1/;x;s/\1/\\\\macro\{\1\}/;:\1;$!N#p' list.txt|sed -rf - text.txt
\macro{fish} 吃了 \macro{bird}。
\macro{squirrel} 生活在 \macro{mountain} 上的 \macro{tree} 中。
这条鱼不喜欢吃松鼠，因为它们住在树上太高了。

score 1 · Accepted Answer

This will preserve white space (unlike any solution that assigns to fields) and won't incorrectly match the first 2 letters of "there" when looking for "the" (unlike any solution that doesn't enclose "word" in word delimiters "<...>" or equivalent)

$ gawk 'NR==FNR{list[$0];next}
    {
        for (word in list)
            if ( sub("\\<"word"\\>","\\macro{&}") )
                delete list[word]
    }
1' list.txt text.txt
The \macro{fish} ate the birds.
The \macro{squirrel} lived in the \macro{tree} on the \macro{mountain}.
The fish did not like eating squirrels as they lived too high in the trees.

The only caveat with this solution is that if "word" contains any RE meta-characters (e.g. *, +) they will be evaluated by the sub(). Since you seem to be using English words that wouldn't happen, but if it can let us know as you need a different solution.

I see you posted that partial matches actually are desirable (e.g. "the" should match the start of "theory") so then you want this:

$ awk 'NR==FNR{list[$0];next}
    {
        for (word in list)
            if ( sub(word,"\\macro{&}") )
                delete list[word]
    }
1' list.txt text.txt

as long as no RE metacharacters can appear in your matching words from list.txt, or this otherwise:

$ awk 'NR==FNR{list[$0];next}
    {
        for (word in list)
            start = index($0,word)
            if ( start > 0 ) {
                $0 = substr($0,1,start-1) \
                     "\\macro{" word "}"  \
                     substr($0,start+length(word))
                delete list[word]
            }
    }
1' list.txt text.txt

That last is the most robust solution as it does a string comparison rather than an RE comparison so is unaffected by RE metacharacters and also will not affect white space (which I know you said you don't care about right now).

score 1 · Accepted Answer

我还是 Awk 的新手，但这似乎可行。在查找“prop”时，请注意“propane”之类的词（并且您无法匹配确切的词，因为“props”不会更改为“\macro{prop}s”）。你需要一本更好的字典，而且可能不仅仅是 Awk 来处理这样的案例。

NR==FNR {
    #Skip empty lines.
    if ($0 ~ /^$/)
        next;
    macros[$0] = "\\macro{"$0"}";
    next;
}
{
    for (name in macros) {
        n = name;
        #Sometimes a word may have a [ in it or other special chars.
        gsub(/[.[\(*+?{|^$]/, "[&]", n);
        if (sub(n, macros[name]))
            delete macros[name];
    }
    print;
}

score 1 · Accepted Answer

好有趣的问题。

我可以为你想出以下 awk：

awk 'NR==FNR{a[$1]=$1;next} 
   {for (v in a) if (a[v] != "") {r=sub(v, "\\macro{" v "}"); if (r) a[v]=""}
   }'1 list.txt text.txt

regex - 如何使用正则表达式在列表中每个匹配项的第一次出现周围添加`\macro{}`

4 回答 4

Related

Reference