regex - 在匹配模式后如何在 perl.regex 中的字符串后添加破折号

Question

我有这种类型的数据：请帮助我，我是正则表达式的新手，请在回答时解释每个步骤。谢谢..

7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN

7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN

7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN

7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

我只想从上面的行中提取这些数据：

7210315_AX1A_MOTORTRAEGER_VORN_AUSSEN

7210316_W1A_MOTORTRAEGER_VORN_AUSSEN

7210243_U1A_MOTORTRAEGER_VORN_INNEN

7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

那么如果AX1A在下划线后包含两个连续的字母，则应写为 AX_ ，如果包含单个数字和单个字母，则它们变为 -1_ 和 -A_ ，因此在应用此模式后，它将变为： AX_-1_-A_ 和所有其他数据应保持不变。

同样在下一行“W1A”中，所以首先它包含单个字母“W”，应该转换为 -W_ 现在下一个字符是单个数字，所以它也应该转换为相同的模式 -1_ 同样最后一个也被视为相同。所以它变成-W_-1_-A_

我们只对将正则表达式应用于数字后跟下划线的部分感兴趣。

_AX1A_

_W1A_

_U1A_

_AV21NA_

输出应该是：

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN

7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN

7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN

7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

score 1 · Accepted Answer

我不知道你需要剥离的所有细节，但我会推断并让你澄清这是否不能完全满足你的需要。

对于第一步，提取1X50_RE_and 1X50_LI，您可以搜索这些字符串并将它们替换为空。

接下来，要将您的第二个字母/数字代码分成小块，您可以使用一对匹配项，对每个匹配项使用前瞻。但是，由于您只想弄乱第二个代码块，因此我会先拆分整个线路，处理第二个代码块，然后再将各个部分重新组合在一起。

while (<$input>) {

    # Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
    s/1X50_(RE|LI)_//;

    my @pieces = split /_/; # split the line into pieces at each underscore

    # Just working with the second chunk. /g, means do it for all matches found
    $pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
    $pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A

    # Join the pieces back together again
    $_ = join '_', @pieces;

    print;
}

$_如果您不指定，则它是许多 Perl 操作所使用的变量。读取名为的文件句柄的<$input>下一行。, ,和函数在没有给出时起作用。运算符是您告诉 Perl 使用（或您正在处理的任何变量）而不是用于正则表达式操作的方式。（对于or ，您将变量作为参数传递，因此与和相同。）$input$_s///splitprint$_=~$pieces[1]$_splitprintsplit /_/split /_/, $_printprint $_

哦，解释一下正则表达式：

s/1X50_(RE|LI)_//;

这是匹配任何包含1X50_REor的东西1X50_LI（这(|)是一个替代列表）并将它们替换为空（//最后是空的）。

查看其他行之一：

s/([A-Z])(?=[0-9])/$1_-/g;

将原因(...)周围的普通括号设置为内部匹配的任何字母（在本例中为字母 AZ）。括号导致零宽度正向预测断言。这意味着正则表达式仅在字符串中的下一个匹配表达式（数字，0-9）时才匹配，但匹配的那部分不包含在被替换的字符串中。[A-Z]$1(?=...)

/$1_-/导致字符串的匹配部分 ,被[A-Z]替换为括号捕获的值, (...), 但在查找头之前, [0-9], 添加_-您需要的。

score 1 · Accepted Answer

你确定是这样的：

while (<DATA>) {
    s/1X50_(LI|RE)_//;
    s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
    s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
    s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
    print;
}

__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

输出：

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

score 1 · Accepted Answer

use strict;
use warnings;

my $match 
    = qr/
    ( \d+          # group of digits
      _            # followed by an underscore
    )              # end group
    ( \p{Alpha}+ ) # group of alphas             
    ( \d+ )        # group of digits
    ( \p{Alpha}* ) # group of alphas
    ( \w+ )        # group of word characters
    /x
    ;

while ( my $record = <$input> ) { # record of input
    # match and capture
    if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
        say $pre 
             # if the alpha has length 1, add a dash before it
          . ( length $pre_alpha == 1 ? '-' : '' )
            # then the alpha
          . $pre_alpha
            # then the underscore
          . '_'
            # test if the length of the number is 1 and the length of the 
            # trailing alpha string is 1 
          . ( length( $num ) == 1 && length( $post_alpha ) == 1
              # if true, apply a dash before each 
            ? "-$num\_-$post_alpha" 
              # otherwise treat as AV21NA in example.
            : "$num\_$post_alpha"
            )
          . $post
          ;

    }
}

score 1 · Accepted Answer

#!/usr/bin/perl -w
use strict;
while (<>) {
    next if /^\s*$/;
    chomp;
    ## Remove those parts of the line we do not want
    ## You do not specify what, if anything, is constant about
    ## the parts you do not want. One of the following cases should
    ## serve.

    ## i) Remove the string _1X50_ and the next characters between
    ## two underscores:
    s/_1X50_.+?_/_/;

    ## ii) keep the first 2 and last 3 sections of each line.
    ## Uncomment this line and comment the previous one to use this:
    #s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;

    ## The line now contains only those regions we are 
    ## interested in. Split on '_' to collect an array of the
    ## different parts (@a):
    my @a=split(/_/);

    ## $a[1] is the second string, eg AX1A,W1A etc.
    ## We search for one or more letters, followed by one or more digits
    ## followed by one or more letters. The 'i' operand makes the match
    ## case Insensitive and the 'g' operand makes the search global, allowing
    ## us to capture the matches in the @matches array. 
    my @matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);

    ## So, for each of the matched strings, if the length of the match
    ## is less than 2, add a '-' to the beginning of the string:
    foreach my $match (@matches) {
        if (length($match)<2) {
        $match="-" . $match;
        }
    }
    ## Now replace the original $a[1] with each string in
    ## @matches, connected by '_':
    $a[1]=join("_", @matches);

    ## Finally, build the string $kk by joining each element
    ## of the line (@a) by a '_', and print:
    my $kk=join("_", @a);
    print "$kk\n";
}

score -1 · Accepted Answer

如果您是 regex 初学者，zostay 的拆分线建议可能会使事情变得更容易。但是，从性能的角度来看，避免拆分是最佳的。以下是如何在不拆分的情况下执行此操作：

open IN_FILE, "filename" or die "Whoops!  Can't open file.";
while (<IN_FILE>)
{
     s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/ 
          or print "line didn't match: $line\n";
     s/1X50_(LI|RE)_//;
}

分解第一个模式： s///是搜索和替换运算符。 ^匹配行的开头 \d{7}_匹配七位数字，后跟下划线 \K后视运算符。这意味着之前出现的任何内容都不会成为被替换字符串的一部分。()每组括号指定将被捕获的匹配块。这些将按顺序放入匹配变量 $1、$2 等中。[A-Z]{1,2}这意味着匹配一到两个大写字母。您可能会弄清楚括号中的其他两个部分的含义。-${1}-${2}-${3}用前三个匹配变量替换匹配的内容，前面是破折号。花括号的唯一原因是为了明确变量名是什么。

regex - 在匹配模式后如何在 perl.regex 中的字符串后添加破折号

5 回答 5

Related

Reference