I have a question I am hoping someone could help with...

I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).

The variable contains data such as these:

$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"

The only bits I am interested in from the above examples are:

@array = ("cat_dog","horse","rabbit","chicken-pig")
@array = ("elephant","MOUSE_RAT","spider","lion-tiger") 
@array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")

The problem I am having:

I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.

But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.

Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...

For example:

Saturn was long thought to be the only ringed planet, however, this is now known not to be the case. 
                                                     ^        ^
                                                     |        |
                                    note the spaces here and here

I am not interested in any cases where the comma is followed by a space (as shown above).

I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)

I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.


4 回答 4






以下$&在 a 上拆分每个匹配,并将结果推送到@matches

my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my @matches;

while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
    push(@matches, split(/,/, $&));

print join("\n",@matches),"\n";
于 2013-04-25T11:31:59.243 回答


my @array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split


  1. split在空格 ( )上分割线
  2. 只保留两端没有逗号但内部有一个逗号的元素 ( grep)
  3. 将每个这样的元素拆分为部分 (mapsplit)

这样,您可以轻松更改部分,例如消除两个连续的逗号添加&& !/,,/内部grep

于 2013-04-25T11:46:12.393 回答


    use warnings;
    use strict;

    my @strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
    "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf", 
     "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew", 
     "Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
     "Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");

    my $regex = qr/
                \s #From your examples, it seems as if every
                   #comma separated list is preceded by a space.
                        [^,\s]+ #Now, not a comma or a space for the
                                 #terms of the list

                        ,        #followed by a comma
                    [^,\s]+     #followed by one last term of the list

    my @matches = map {
                    $_ =~ /$regex/;
                    if ($1) {
                        my $comma_sep_list = $1;
                        [split ',', $comma_sep_list];
                    else {
                } @strs;
于 2013-04-25T11:46:36.867 回答
$var =~ tr/ //s;    
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
      push (@arr, $&);


(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,)      : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig
于 2013-04-25T11:34:03.943 回答