ruby - 构建文件系统爬虫的正则表达式问题

Question

我正在构建一个爬虫来搜索我的文件系统以查找包含特定信息的特定文档。然而，正则表达式部分让我有点困惑。我的桌面上有一个包含“teststring”和测试信用卡号“4060324066583245”的测试文件，下面的代码将正常运行并找到包含以下内容的文件teststring：

require 'find'
count = 0

Find.find('/') do |f|              # '/' for root directory on OS X
  if f.match(/\.doc\Z/)            # check if filename ends in desired format
    contents =  File.read(f)
      if /teststring/.match(contents) 
      puts f
      count += 1
    end
  end
end

puts "#{count} sensitive files were found"

运行此确认爬虫正在工作并正确找到匹配项。但是，当我尝试运行它以查找测试信用卡号时，它找不到匹配项：

require 'find'
count = 0

Find.find('/') do |f|              # '/' for root directory on OS X
  if f.match(/\.doc\Z/)            # check if filename ends in desired format
    contents =  File.read(f)
      if /^4[0-9]{12}(?:[0-9]{3})?$/.match(contents) 
      puts f
      count += 1
    end
  end
end

puts "#{count} sensitive files were found"

我检查了 rubular.com 上的正则表达式4060324066583245作为测试数据，它包含在我的测试文档中，Rubular 验证该数字是否与正则表达式匹配。总结一下：

爬虫在第一种情况下使用teststring- 验证爬虫是否正确扫描我的文件系统并读取所需文件类型的内容
Rubular 验证我的正则表达式成功匹配我的测试信用卡号4060324066583245
爬虫找不到测试信用卡号。

有什么建议么？我不知道为什么 Rubular 将正则表达式显示为有效但脚本在我的机器上运行时不起作用。

score 2 · Accepted Answer

^和$是锚点，分别将匹配项与字符串的开头和结尾联系起来。

因此，^[0-9]{4}$将匹配"1234"，但不匹配"12345"或" 1234 "等。

您应该改用单词边界：

if contents =~ /\b4[0-9]{12}(?:[0-9]{3})?\b/

ruby - 构建文件系统爬虫的正则表达式问题

1 回答 1

Related

Reference