regex - Notepad ++：从包含一组圆括号的非常长的字符串中提取所有单词

Question

我有一个用德语写的大 .txt 文件。它是许多人讲话的记录。当使用一个单词的缩写形式时，该单词的正确形式写在它周围或里面，用括号括起来。我想提取此 .txt 中存在的所有此类示例作为列表。我尝试了一些正则表达式，但我似乎无法让它突出整个“单词”。

有任何想法吗？

这是 .txt 的一部分，其中突出显示了我要提取的单词：

Ich hab(e) am Achtundzwanzigsten achten neunzehnhundertneunzig Geburtstag。还有 wenn ich mich beschreiben sollte、dann muss ich sagen freundlich、unkompliziert und bescheiden。你好 wie gehts (geh es)目录。Na was machst (machst du) den jetzt heut(e)。嗯，嗯，是noch吗？Stör(e) ich? Ja das is(t)，嗯，所以，würd(e) ich das so sagen....

谢谢！

score 2 · Accepted Answer

如果我很了解您的需求，如何：

(\w+\(\w+\))| \([\w\s]+\)

解释：

The regular expression:

(?-imsx:(\w+\(\w+\))| \([\w\s]+\))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \(                       '('
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \)                       ')'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
                           ' '
----------------------------------------------------------------------
  \(                       '('
----------------------------------------------------------------------
  [\w\s]+                  any character of: word characters (a-z, A-
                           Z, 0-9, _), whitespace (\n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))
----------------------------------------------------------------------
  \)                       ')'
----------------------------------------------------------------------
)                        end of grouping

score 0 · Accepted Answer

Notepad++ uses a regex flavor that may not be POSIX compliant, hence does not support word boundaries. (Atleast v5.9.2 does not support it) Try this regular expression:

[^\s]*\([^)]*\)[^\s\.\,\;\?\!]*

[^\s]* : detects beginning of a word by not matching any whitespace before a word (tab, space, etc..)
\([^)]*\) : matches the brackets and its content
[^\s\.\,\;\?\!]* : detects ending of a word by not matching any whitespace or possible punctuation symbols.

You can extend this by adding more punctuation marks before or after the word (like quotes).
Successfully tested this on Notepad++ v5.9.2 on your sample text.

score 0 · Accepted Answer

此正则表达式查找介于(和)包含的所有内容以及之前的所有内容(及其前面的空格字符：

[^ ]*\([^)]*\)

现在将您的文本转换为一个不错的列表：

打开查找/替换对话框 (Ctrl-H)
找什么：.*?([^ ]*\([^)]*\))
用。。。来代替：\1\n
选中“匹配换行符”的“正则表达式”
用光标在文件开头按“全部替换”（Ctrl-Home）
忽略或删除最后一行

现在你有一个很好的列表，所有这些单词都在单独的行上。

regex - Notepad ++：从包含一组圆括号的非常长的字符串中提取所有单词

3 回答 3

Related

Reference