perl - 如何使用 perl 递归搜索目录和所有子目录

Question

我看到这个链接使用 glob

这不是我想做的。

这是我的计划。为了在目录中搜索与字符串部分匹配的任何文件，给我的函数作为参数，say/home/username/sampledata和字符串，say data。

我让用户可以选择在执行时包含一个标志，以强制是否检查子目录，目前默认情况下脚本不包含子目录。

包含子目录的伪代码如下所示。

我保存文件路径的数组是全局的

  @fpaths;

  foo($dir);

  sub foo{
      get a tmp array of all files

      for ($i=0 ; $i<@tmp ; $i++) {
          next if ( $tmp[$i]is a hidden file and !$hidden) ; #hidden is a flag too

          if($tmp[$i] is file) {
               push (@fpaths, $dir.$tmp[$i]);
          }
          if($tmp[$i] is dir) {
               foo($dir.$tmp[$i]);
          }

       }
   }

这看起来很扎实。

我希望实现的是保存完整路径名的每个文件的数组。

我不知道该怎么做的部分是获取每个文件的列表。希望这可以用 glob 来完成。

我已经能够使用opendir/readdir来读取每个文件，如果我知道如何检查结果是文件还是目录，我可以再次这样做。

所以我的问题是：

如何使用glob路径名来获取每个文件/子目录的数组
如何检查以前找到的数组上的项目是目录还是文件

谢谢大家

score 9 · Accepted Answer

我会用File::Find

请注意，这File::Find::name是给定文件的完整路径。这将包括目录，因为它们也是文件。

这只是供读者了解其余细节的示例。

use warnings;
use strict;
use File::Find;

my $path = "/home/cblack/tests";

find(\&wanted, $path);

sub wanted {
   return if ! -e; 

   print "$File::Find::name\n" if $File::Find::name =~ /foo/;
   print "$File::Find::dir\n" if $File::Find::dir =~ /foo/;
}

更好的是，如果您想将所有这些推送到列表中，您可以这样做：

use File::Find;

main();

sub main {
    my $path = "/home/cblack/Misc/Tests";
    my $dirs = [];
    my $files= [];
    my $wanted = sub { _wanted($dirs, $files) };

    find($wanted, $path);
    print "files: @$files\n";
    print "dirs: @$dirs\n";
}

sub _wanted {
   return if ! -e; 
   my ($dirs, $files) = @_;

   push( @$files, $File::Find::name ) if $File::Find::name=~ /foo/;
   push( @$dirs, $File::Find::dir ) if $File::Find::dir =~ /foo/;
}

score 3 · Accepted Answer

我不明白为什么glob可以解决您如何检查目录条目是文件还是目录的问题。如果您以前使用readdir过，请坚持使用
不要忘记你必须小心处理链接，否则你的递归可能永远不会结束
还要记住，readdir返回.以及..真实的目录内容
使用-fand-d检查节点名称是文件还是目录，但请记住，如果它的位置不是您当前的工作目录，那么您必须通过添加路径来完全限定它，否则您将谈论完全可能不存在的不同节点
除非这是一种学习体验，否则你最好写一些现成的和测试过的东西，比如File::Find

score 3 · Accepted Answer

受Nima Soroush 回答的启发，这里有一个类似于 Bash 4 的选项的广义递归通配函数globstar，它允许在子树的所有级别上匹配**.

例子：

# Match all *.txt and *.bak files located anywhere in the current
# directory's subtree.
globex '**/{*.txt,*.bak}' 

# Find all *.pm files anywhere in the subtrees of the directories in the
# module search path, @INC; follow symlinks.
globex '{' . (join ',', @INC) . '}/**/*.pm', { follow => 1 }

注意：虽然这个函数File::Find与内置函数相结合，glob如果你熟悉 's 的行为，可能大部分工作都如你glob所愿，但排序和符号链接行为有很多微妙之处 - 请参阅底部的评论。

一个明显的偏差glob()是给定模式参数中的空白被认为是模式的一部分。要指定多个模式，请将它们作为单独的模式参数传递或使用大括号表达式，如上例所示。

源代码

sub globex {

  use File::Find;
  use File::Spec;
  use File::Basename;
  use File::Glob qw/bsd_glob GLOB_BRACE GLOB_NOMAGIC GLOB_QUOTE GLOB_TILDE GLOB_ALPHASORT/;

  my @patterns = @_;
  # Set the flags to use with bsd_glob() to emulate default glob() behavior.
  my $globflags = GLOB_BRACE | GLOB_NOMAGIC | GLOB_QUOTE | GLOB_TILDE | GLOB_ALPHASORT;
  my $followsymlinks;
  my $includehiddendirs;
  if (ref($patterns[-1]) eq 'HASH') {
    my $opthash = pop @patterns;
    $followsymlinks = $opthash->{follow};
    $includehiddendirs = $opthash->{hiddendirs};
  }
  unless (@patterns) { return };

  my @matches;
  my $ensuredot;
  my $removedot;
  # Use fc(), the casefolding function for case-insensitive comparison, if available.
  my $cmpfunc = defined &CORE::fc ? \&CORE::fc : \&CORE::lc;

  for (@patterns) {
    my ($startdir, $anywhereglob) = split '(?:^|/)\*\*(?:/|$)';
    if (defined $anywhereglob) {  # recursive glob
      if ($startdir) {
        $ensuredot = 1 if m'\./'; # if pattern starts with '.', ensure it is prepended to all results
      } elsif (m'^/') { # pattern starts with root dir, '/'
        $startdir = '/';
      } else { # pattern starts with '**'; must start recursion with '.', but remove it from results
        $removedot = 1;
        $startdir = '.';
      }
      unless ($anywhereglob) { $anywhereglob = '*'; }
      my $terminator = m'/$' ? '/' : '';
      # Apply glob() to the start dir. as well, as it may be a pattern itself.
      my @startdirs = bsd_glob $startdir, $globflags or next;
      find({
          wanted => sub {
            # Ignore symlinks, unless told otherwise.
            unless ($followsymlinks) { -l $File::Find::name and return; }
            # Ignore non-directories and '..'; we only operate on 
            # subdirectories, where we do our own globbing.
            ($_ ne '..' and -d) or return;
            # Skip hidden dirs., unless told otherwise.
            unless ($includehiddendirs) {  return if basename($_) =~ m'^\..'; }
            my $globraw;
            # Glob without './', if it wasn't part of the input pattern.
            if ($removedot and m'^\./(.+)$') { 
              $_ = $1;
            }
            $globraw = File::Spec->catfile($_, $anywhereglob);
            # Ensure a './' prefix, if the input pattern had it.
            # Note that File::Spec->catfile() removes it.
            if($ensuredot) {
              $globraw = './' . $globraw if $globraw !~ m'\./';
            }
            push @matches, bsd_glob $globraw . $terminator, $globflags;
          },
          no_chdir => 1,
          follow_fast => $followsymlinks, follow_skip => 2,
          # Pre-sort the items case-insensitively so that subdirs. are processed in sort order.
          # NOTE: Unfortunately, the preprocess sub is only called if follow_fast (or follow) are FALSE.
          preprocess => sub { return sort { &$cmpfunc($a) cmp &$cmpfunc($b) } @_; }
        }, 
        @startdirs);
    } else {  # simple glob
      push @matches, bsd_glob($_, $globflags);
    }
  }
  return @matches;
}

注释

SYNOPSIS
  globex PATTERNLIST[, \%options]

DESCRIPTION
  Extends the standard glob() function with support for recursive globbing.
  Prepend '**/' to the part of the pattern that should match anywhere in the
  subtree or end the pattern with '**' to match all files and dirs. in the
  subtree, similar to Bash's `globstar` option.

  A pattern that doesn't contain '**' is passed to the regular glob()
  function.
  While you can use brace expressions such as {a,b}, using '**' INSIDE
  such an expression is NOT supported, and will be treated as just '*'.
  Unlike with glob(), whitespace in a pattern is considered part of that
  pattern; use separate pattern arguments or a brace expression to specify
  multiple patterns.

  To also follow directory symlinks, set 'follow' to 1 in the options hash
  passed as the optional last argument.
  Note that this changes the sort order - see below.

  Traversal:
  For recursive patterns, any given directory examined will have its matches
  listed first, before descending depth-first into the subdirectories.

  Hidden directories:
  These are skipped by default, onless you set 'hiddendirs' to 1 in the
  options hash passed as the optional last argument.

  Sorting:
  A given directory's matching items will always be sorted
  case-insensitively, as with glob(), but sorting across directories
  is only ensured, if the option to follow symlinks is NOT specified.

  Duplicates:
  Following symlinks only prevents cycles, so if a symlink and its target
  they will both be reported.
  (Under the hood, following symlinks activates the following 
   File::Find:find() options: `follow_fast`, with `follow_skip` set to 2.)

  Since the default glob() function is at the heart of this function, its
  rules - and quirks - apply here too:
  - If literal components of your patterns contain pattern metacharacters,
    - * ? { } [ ] - you must make sure that they're \-escaped to be treated
    as literals; here's an expression that works on both Unix and Windows
    systems: s/[][{}\-~*?]/\\$&/gr
  - Unlike with glob(), however, whitespace in a pattern is considered part
    of the pattern; to specify multiple patterns, use either a brace
    expression (e.g., '{*.txt,*.md}'), or pass each pattern as a separate
    argument.
  - A pattern ending in '/' restricts matches to directories and symlinks
    to directories, but, strangely, also includes symlinks to *files*.
  - Hidden files and directories are NOT matched by default; use a separate
    pattern starting with '.' to include them; e.g., globex '**/{.*,*}'
    matches all files and directories, including hidden ones, in the 
    current dir.'s subtree.
    Note: As with glob(), .* also matches '.' and '..'
  - Tilde expansion is supported; escape as '\~' to treat a tilde as the
    first char. as a literal.
 -  A literal path (with no pattern chars. at all) is echoed as-is, 
    even if it doesn't refer to an existing filesystem item.

COMPATIBILITY NOTES
  Requires Perl v5.6.0+
  '/' must be used as the path separator on all platforms, even on Windows.

EXAMPLES
  # Find all *.txt files in the subtree of a dir stored in $mydir, including
  # in hidden subdirs.
  globex "$mydir/*.txt", { hiddendirs => 1 };

  # Find all *.txt and *.bak files in the current subtree.
  globex '**/*.txt', '**/*.bak'; 

  # Ditto, though with different output ordering:
  # Unlike above, where you get all *.txt files across all subdirs. first,
  # then all *.bak files, here you'll get *.txt files, then *.bak files
  # per subdirectory encountered.
  globex '**/{*.txt,*.bak}';

  # Find all *.pm files anywhere in the subtrees of the directories in the
  # module search path, @INC; follow symlinks.
  # Note: The assumption is that no directory in @INC has embedded spaces
  #       or contains pattern metacharacters.
  globex '{' . (join ',', @INC) . '}/**/*.pm', { follow => 1 };

score 1 · Accepted Answer

您可以将此方法用作分隔特定文件类型的递归文件搜索，

my @files;
push @files, list_dir($outputDir);

sub list_dir {
        my @dirs = @_;
        my @files;
        find({ wanted => sub { push @files, glob "\"$_/*.txt\"" } , no_chdir => 1 }, @dirs);
        return @files;
}

score 0 · Accepted Answer

我尝试通过仅使用 readdir 来实现这一点。我把我的代码留在这里，以防它对任何人有用：

sub rlist_files{
    my @depth = ($_[0],);
    my @files;
    while ($#depth > -1){
        my $dir = pop(@depth);
        opendir(my $dh, $dir) || die "Can't open $dir: $!";
        while (readdir $dh){
            my $entry = "$dir/$_";
            if (!($entry =~ /\/\.+$/)){
                if (-f $entry){
                    push(@files,$entry);
                }
                elsif (-d $entry){
                    push(@depth, $entry);
                }
            }
        }
        closedir $dh;
    }
    return @files;
}

编辑：正如@brian d foy所指出的那样，该代码根本没有考虑符号链接。

作为一个练习，我尝试编写一个新的子程序，该子程序能够递归地跟踪符号链接（可选），而不会陷入循环并且内存使用受到某种限制（使用散列来跟踪访问的符号链接在大型运行中使用了几个 GB）。正如我所做的那样，我还添加了传递正则表达式来过滤文件的选项。同样，我将代码留在这里，以防它对任何人有用：

sub rlist_files_nohash{
    use Cwd qw(abs_path);
    my $input_path = abs_path($_[0]);
    if (!defined $input_path){
        die "Cannot find $_[0]."
    }
    my $ignore_symlinks = 0;
    if ($#_>=1){
        $ignore_symlinks = $_[1];
    }
    my $regex;
    if ($#_==2){
        $regex = $_[2];
    }   
    my @depth = ($input_path,);
    my @files;
    my @link_dirs;
    while ($#depth > -1){
        my $dir = pop(@depth);
        opendir(my $dh, $dir) or die "Can't open $dir: $!";
        while (readdir $dh){
            my $entry = "$dir/$_";
            if (!($entry =~ /\/\.+$/)){
                if (-l $entry){
                    if ($ignore_symlinks){
                        $entry = undef;
                    }
                    else{
                        while (defined $entry && -l $entry){
                            $entry = readlink($entry);
                            if (defined $entry){
                                if (substr($entry, 0, 1) ne "/"){
                                    $entry = $dir."/".$entry;
                                }
                                $entry = abs_path($entry);
                            }
                        }
                        if (defined $entry && -d $entry){
                            if ($input_path eq substr($entry,0,length($input_path))){
                                $entry = undef;
                            }
                            else{
                                for (my $i = $#link_dirs;($i >= 0 && defined $entry); $i--){
                                    if (length($link_dirs[$i]) <= length($entry) && $link_dirs[$i] eq substr($entry,0,length($link_dirs[$i]))){
                                        $entry = undef;
                                        $i = $#link_dirs +1;
                                    }
                                }
                                if(defined $entry){
                                    push(@link_dirs, $entry);
                                }
                            }
                        }
                    }
                }
                if (defined $entry){
                    if (-f $entry && (!defined $regex || $entry =~ /$regex/)){
                        push(@files, abs_path($entry));
                    }
                    elsif (-d $entry){
                        push(@depth, abs_path($entry));
                    }
                }
            }
        }
        closedir $dh;
    }
    if ($ignore_symlinks == 0){
        @files = sort @files;
        my @indices = (0,);
        for (my $i = 1;$i <= $#files; $i++){
            if ($files[$i] ne $files[$i-1]){
                push(@indices, $i);
            }
        }
        @files = @files[@indices];
    }
    return @files;
}
#Testing
my $t0 = time();
my @files = rlist_files_nohash("/home/user/", 0, qr/\.pdf$/);
my $tf = time() - $t0;
for my file(@files){
    print($file."\n");
}
print ("Total files found: ".scalar @files."\n");
print ("Execution time: $tf\n");

perl - 如何使用 perl 递归搜索目录和所有子目录

5 回答 5

Related

Reference