ruby - Ruby 使用正则表达式从字符串中提取数据

Question

我正在做一些网络抓取，这是数据的格式

Sr.No.  Course_Code Course_Name Credit  Grade   Attendance_Grade

我收到的实际字符串是以下形式

1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M

我感兴趣的是 Course_Code、Course_Name 和 Grade，在这个例子中，这些值是

Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A

我有什么方法可以使用正则表达式或其他技术来轻松提取此信息，而不是手动解析字符串。我在 1.9 模式下使用 jruby。

score 42 · Accepted Answer

让我们使用 Ruby 的命名捕获和自描述正则表达式！

course_line = /
    ^                  # Starting at the front of the string
    (?<SrNo>\d+)       # Capture one or more digits; call the result "SrNo"
    \s+                # Eat some whitespace
    (?<Code>\S+)       # Capture all the non-whitespace you can; call it "Code"
    \s+                # Eat some whitespace
    (?<Name>.+\S)      # Capture as much as you can
                       # (while letting the rest of the regex still work)
                       # Make sure you end with a non-whitespace character.
                       # Call this "Name"
    \s+                # Eat some whitespace
    (?<Credit>\S+)     # Capture all the non-whitespace you can; call it "Credit"
    \s+                # Eat some whitespace
    (?<Grade>\S+)      # Capture all the non-whitespace you can; call it "Grade"
    \s+                # Eat some whitespace
    (?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
    $                  # Make sure that we're at the end of the line now
/x

str = "1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M"
parts = str.match(course_line)

puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
      Grade: #{parts['Grade']}".strip

#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=>       Grade: A

score 6 · Accepted Answer

只是为了好玩：

str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
tok = str.split /\s+/
data = {'Sr.No.' => tok.shift, 'Course_Code' => tok.shift, 'Attendance_Grade' => tok.pop,'Grade' => tok.pop, 'Credit' => tok.pop, 'Course_Name' => tok.join(' ')}

score 3 · Accepted Answer

我是否正确地看到分隔符始终是 3 个空格？然后只是：

serial_number, course_code, course_name, credit, grade, attendance_grade = 
  the_string.split('   ')

score 3 · Accepted Answer

假设除了课程描述之外的所有内容都由单个单词组成，并且没有前导或尾随空格：

/^(\w+)\s+(\w+)\s+([\w\s]+)\s+(\w+)\s+(\w+)\s+(\w+)$/

您的示例字符串将产生以下匹配组：

1.  1
2.  CA727
3.  PRINCIPLES OF COMPILER DESIGN
4.  3
5.  A
6.  M

score 1 · Accepted Answer

这个答案不是非常地道的 Ruby，因为在这种情况下，我认为清晰胜于聪明。要解决您描述的问题，您真正需要做的就是用空格分隔行：

line = '1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M'
array = line.split /\t|\s{2,}/
puts array[1], array[2], array[4]

这假设您的数据是常规的。如果没有，您将需要更加努力地调整您的正则表达式，并可能处理您没有所需数量的字段的边缘情况。

留给后代的注意事项

OP 更改了输入字符串，并将分隔符修改为字段之间的单个空格。我将按原样保留对原始问题的回答（包括原始输入字符串以供参考），因为它可能会在不太具体的情况下帮助除 OP 之外的其他人。

ruby - Ruby 使用正则表达式从字符串中提取数据

5 回答 5

留给后代的注意事项

Related

Reference