regex - 我找不到合适的正则表达式

Question

我有以下文件（像这个方案，但更长）：

LSE           ZTX                       
    SWX         ZURN                    
LSE           ZYT
NYSE                            CGI

每行中有 2 个单词（如 ie LSE ZTX），开头、结尾和中间都有可选的空格和/或制表符。有人可以帮我用正则表达式匹配这两个词吗？按照示例，我希望第一行的 LSE 为 1 美元，ZTX 为 2 美元，SWX 为 1 美元，ZURN 为第二行，等等。我尝试过类似的方法：

$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;

我不知道该怎么说，可能有空格或制表符（或两者混合，例如 \t\s\t）

score 3 · Accepted Answer

总是两个词，你不需要匹配整行，所以你最简单的正则表达式是：

/(\w+)\s+(\w+)/

score 3 · Accepted Answer

如果你只想匹配前两个单词，最基本的就是匹配任何不是空格的字符序列：

my ($word1, $word2) = $line =~ /\S+/g;

这会将前两个单词捕获$line到变量中（如果它们存在）。/g请注意，使用修饰符时不需要括号。如果要捕获所有现有匹配项，请改用数组。

score 1 · Accepted Answer

\s还包括制表，因此您的正则表达式如下所示：

$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;

第一个词在第一组 ($1) 中，第二个在 $2 中。

您可以根据[A-Z]需要更改为更方便的方式。

这是来自YAPE::Regex::Explain的解释

The regular expression:

(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

score 1 · Accepted Answer

我想这就是你想要的

^\s*([A-Z]+)\s+([A-Z]+)

在 Regexr 上看到它，您会发现第 1 组中一行的第一个代码和第 2 组中的第二个代码。\s是一个空白字符，它包括例如空格、制表符和换行符。

在 Perl 中是这样的：

($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;

我认为您正在逐行读取文本文件，因此您不需要修饰符sand m，g也不需要。

如果代码不仅是 ASCII 字母，则替换[A-Z]为\p{L}. \p{L}是一个Unicode 属性，它将匹配每种语言中的每个字母。

score 1 · Accepted Answer

使用选项“多行”这个正则表达式：

^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$

将为您提供 N 个匹配项，每个匹配项包含 2 个名为：- word1 - word2 的组

score 1 · Accepted Answer

^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$

这是做什么的

^             // Matches the beginning of a string
\s*           // Matches a space/tab character zero or more times
([A-Z]{3,4})  // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+           // Then matches at least one tab or space
([A-Z]{3,4})  // Matches any letter A-Z either 3 or 4 times and captures to $2
$             // Matches the end of a string

score 0 · Accepted Answer

你可以split在这里使用：

use strict;
use warnings;

while (<DATA>) {
    my ( $word1, $word2 ) = split;
    print "($word1, $word2)\n";
}

__DATA__
LSE         ZTX                       
    SWX         ZURN                    
LSE         ZYT
NYSE                            CGI

输出：

(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)

score -1 · Accepted Answer

假设行首的空格是你用来识别你想要的代码的，试试这个：

在换行符处拆分你的字符串，然后试试这个正则表达式：

^\s+(\w+\s+){2}$

这将只匹配以一些空格开头，后跟一个（单词 - 一些空格 - 单词），然后以一些空格结尾的行。

# ^           --> String start
# \s+         --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $           --> String end.

但是，如果您想单独捕获代码，请尝试以下操作：

$line =~ /^\s*(\w+)\s+(\w+)/;

# \s*   --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+   --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),

score -2 · Accepted Answer

-2

这将匹配您的所有代码

/[A-Z]+/

于 2013-01-09T08:13:38.780 回答

regex - 我找不到合适的正则表达式

9 回答 9

Related

Reference