regex - Perl - 在 csv 中搜索特定字符串并提取紧随其后的字符

Question

我有一个包含 2 列的 csv 文件：一个 ID 列和一个自由文本列。ID 列包含一个 16 个字符的字母数字 id，但它可能不是单元格中存在的唯一数据：它可能是一个空白单元格，或者一个仅包含 16 个字符 id 的单元格，或者包含一堆东西以及以下埋在其中 - “user_id=xxxxxxxxxxxxxxxxxx”

我想要的是以某种方式从具有它的单元格中提取 16 个字符的 id。所以我需要：（a）忽略空白单元格（b）提取整个单元格的内容，如果它只有一个连续的 16 个字符的字符串，中间没有空格（c）查找模式“user_id=”，然后提取紧随其后的 16 个字符

我看到很多用于模式匹配或查找/替换字符串等的 Perl 脚本，但我不确定如何在同一列上一个接一个地进行不同类型的解析/模式搜索和提取。您可能已经意识到，我对 Perl 还很陌生。

score 1 · Accepted Answer

我了解您希望 (1) 跳过不包含任何内容或不符合您的规范的行。(2) 如果它们是单元格的唯一内容，则捕获 16 个非空格字符。(3) 捕捉文字模式“user_id=”之后的 16 个非空格字符。

如果也可以捕获空格字符，如果它们遵循 " user_id=" 文字，您可以在适当的位置更改\S为。.

我的解决方案使用Text::CSV来处理处理 CSV 文件的细节。您可以这样做：

use strict;
use warnings;
use autodie;
use open ':encoding(utf8)';
use utf8;
use feature 'unicode_strings';
use Text::CSV;
binmode STDOUT, ':utf8';

my $csv = Text::CSV->new( {binary => 1} ) 
    or die "Cannot use CSV: " . Text::CSV->error_diag;

while( my $row = $csv->getline( \*DATA ) ) {
    my $column = $row->[0];
    if( $column =~ m/^(\S{16})$/ || $column =~ m/user_id=(\S{16})/ ) {
        print $1, "\n";
    }
}

__DATA__
abcdefghijklmnop
user_id=abcdefghijklmnop
abcd fghij lmnop
randomdatAuser_id=abcdefghijklmnopMorerandomdata
user_id=abcd fghij lmnop
randomdatAuser_id=abcd fghij lmnopMorerandomdata

在您自己的代码中，您不会使用DATA文件句柄，但我假设您已经知道如何打开文件。

CSV 是一种看似简单的格式。不要将其高可读性与解析简单性混淆。处理 CSV 时，最好使用经过充分验证的模块来提取列。其他解决方案可能无法使用嵌入引号的逗号、转义逗号、不平衡引号以及我们的大脑在运行中为我们修复的其他异常情况，但这会使纯正则表达式解决方案变得脆弱。

score 0 · Accepted Answer

好吧，我可以为您设置一个基本文件和正则表达式命令，这些命令可能会满足您的需求（对于不熟悉 perl 的人来说，这是一种基本格式）：

use strict;
use warnings;

open FILE "<:utf8", "myfile.csv";
#"slurp" the file into an array, each element is a line
my @lines = <FILE>;
my @idArray;
foreach my $line (@lines){
    #make two captures, the first we can ignore and both are optional
    $line =~ /^(user_id=|)([A-Za-z0-9]{16}|),/;
    #for display purposes, this is just the second captured group
    my $id = $2;
    #if the group actually has something in it, add it to your final array
    if($id){ push @idArray, $id; }
}

score 0 · Accepted Answer

例如，在下一个示例中，只有第 2 行和第 3 行有效，因此在 cell1（column1）中是

字符串长度正好是 16 个字符，或者
有“user=16charshere”

任何其他都无效。

use 5.014;
use warnings;

while(<DATA>) {
    chomp;
    my($col1, @remainder) = split /\t/;
    say $2 if $col1 =~ m/^(|user=)(.{16})$/;
}
__DATA__
ToShort col2    not_valid
a123456789012345    col2    valid
user=b123456789012345   col2    valid
TooLongStringHereSoNotValidOne  col2    not_valid

在此示例中，列是 TAB 分隔的。

score -1 · Accepted Answer

请提供 (a) 一些可用于测试解决方案的示例数据，以及 (b) 请尝试提供您迄今为止为此问题编写的代码。

但是，您可能希望遍历表的所有行，然后split将其放入字段中，对某个字段执行所有操作，执行业务逻辑，然后将所有内容写回。

问题 (c) 由$idField =~ /user_id=(.{16})/; my $id = $1;

如果 user_id 总是出现在一行的开头，这可以解决问题：for (<FILE>) {/^user_id=(.{16})/; ...}

regex - Perl - 在 csv 中搜索特定字符串并提取紧随其后的字符

4 回答 4

Related

Reference