0

第三方系统生成家长教师预约的 HTML 表格:

 Blocks    Teacher 1   Teacher 2   Teacher 3
3:00 pm      Stu A       Stu B
3:10 pm      Stu B                   Stu C
...
5:50 pm      Stu D       Stu A       Stu E

列数根据有多少教师进行预订而变化。行数根据我们创建的插槽数而变化。

最终结果需要是每个老师的哈希值,例如:

{ name: "Teacher 1", email: "teacher.1@school.edu", appointments: [
  { start: "15:00", end: "15:08", attendees: [
    { name: "Stu A Parent 1", email: "stuap1@example.com" },
    { name: "Stu A Parent 2", email: "stuap2@example.com" }
  ] },
  { start: "15:10", end: "15:18", attendees: [
    { name: "Stu B Parent", email: "stubp@example.com" }
  ] },
  ...
  { start: "17:50", end: "17:58", attendees: [
    { name: "Stu D Parent 1", email: "studp1@example.com" },
    { name: "Stu D Parent 2", email: "studp2@example.com" }
  ] },
] },

我认为将 ETL 处理为一行是最有意义的,所以这次我将 Numbers 中的行和列转置并将其保存为 CSV:

Blocks,3:00 pm,3:10 pm,...,5:50 pm
Teacher 1,Stu A,Stu B,...,Stu D
Teacher 2,Stu B,,...,Stu C
Teacher 3,Stu D,Stu A,...,Stu E

我试图让办公室工作人员使用的整个过程尽可能简单,所以是否可以在 Kiba(或纯 Ruby)中进行行和列的转置?在 Kiba 中,我假设我必须处理所有行,为每个老师累积一个哈希,然后在最后输出每个老师的哈希?

4

2 回答 2

2

Kiba作者在这里!

我看到至少有两种方法可以做到这一点(无论你是使用纯 Ruby 还是使用 Kiba):

  • 将您的 HTML 转换为表格,然后使用该数据工作
  • 直接使用 HTML 表格(使用 Nokogiri 和选择器),仅适用于 HTML 大部分是干净的

在所有情况下,因为您正在做一些刮擦;我建议你有一个非常防御性的代码(因为 HTML 更改并且以后可能包含错误或极端情况),例如对行/列包含您期望的内容、验证等事实的强断言。

如果您使用纯 Ruby,那么例如您可以执行以下操作(这里将您的数据建模为用逗号分隔的文本以保持清晰):

task :default do
  data = <<DOC
  Blocks  ,  Teacher 1  , Teacher 2  , Teacher 3
  3:00 pm  ,    Stu A   ,    Stu B   ,          
  3:10 pm   ,   Stu B   ,            ,    Stu C
DOC
  data = data.split("\n").map &->(x) { x.split(",").map(&:strip)}
  blocks, *teachers = data.transpose
  teachers.each do |teacher|
    pp blocks.zip(teacher)
  end
end

这将输出:

[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]
[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]
[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]

您可以根据自己的期望进行按摩(但同样:要非常防御并对所有数据进行断言,包括表格中的单元格数量等,否则您将得到一个错误,不正确的时间表等)。

如果你想使用 Kiba 和 CSS 选择器,你可以这样:

task :default do
  html = <<HTML
    <table>
      <tr>
        <th>Blocks</th>
        <th>Teacher 1</th>
        <th>Teacher 2</th>
        <th>Teacher 3</th>
      </tr>
      <tr>
        <td>3:00 pm</td>
        <td>Stu A</td>
        <td>Stu B</td>
        <td></td>
      </tr>
      <tr>
        <td>3:10 pm</td>
        <td>Stu B</td>
        <td></td>
        <td>Stu C</td>
      </tr>
    </table>
HTML
  require 'nokogiri'
  require 'kiba'
  require 'kiba-common/sources/enumerable'
  require 'kiba-common/transforms/enumerable_exploder'
  Kiba.run do
    # just one doc here, but we could have a sequence instead
    source Kiba::Common::Sources::Enumerable, -> { [html] }

    transform { |r| Nokogiri::HTML(r) }

    transform do |doc|
      Enumerator.new do |y|
        blocks, *teachers = doc.search("table tr:first th").map(&:text)
        # you'd have to add more defensive checks here!!! important!
        teachers.each_with_index do |t, i|
          headers = doc.search("table>tr>:nth-child(1)").map(&:text)
          data = doc.search("table>tr>:nth-child(#{i + 2})").map(&:text)
          y << { teacher: t, data: headers.zip(data) }
        end
      end
    end

    transform Kiba::Common::Transforms::EnumerableExploder

    transform { |r| pp r }
  end
end

这会给:

{:teacher=>"Teacher 1",
 :data=>[["Blocks", "Teacher 1"], ["3:00 pm", "Stu A"], ["3:10 pm", "Stu B"]]}
{:teacher=>"Teacher 2",
 :data=>[["Blocks", "Teacher 2"], ["3:00 pm", "Stu B"], ["3:10 pm", ""]]}
{:teacher=>"Teacher 3",
 :data=>[["Blocks", "Teacher 3"], ["3:00 pm", ""], ["3:10 pm", "Stu C"]]}

我想我更喜欢这两种方法的混合:首先将 HTML 转换为适当的 CSV 文件或内存表,然后从那里转置第二步。

于 2021-03-10T08:46:43.423 回答
1

假设我们有以下时间表。

schedule =<<~END
Blocks,15:00,15:10,15:55
Teacher 1,Stu A,Stu B,Stu C
Teacher 2,Stu B,Stu C,Stu A
Teacher 3,Stu C,Stu A,Stu B
END

要生成所需的哈希数组,我们需要额外的信息。假设我们也得到以下内容。

teacher_emails = {
  "Teacher 1"=>"teacher.1@school.edu",
  "Teacher 2"=>"teacher.2@school.edu",
  "Teacher 3"=>"teacher.3@school.edu"
}
parent_emails = {
  "Stu A"=> { "Parent 1"=>"stuap1@example.com",
              "Parent 2"=>"stuap2@example.com" },
  "Stu B"=> { "Parent"=>"stubp@example.com" },
  "Stu C"=> { "Parent 1"=>"stuapc@example.com",
              "Parent 2"=>"stuapc@example.com" }
}
mins_per_meeting = 8

那么我们可以进行如下操作。

blks, *sched = schedule.split(/\n/)
blks
  #=> "Blocks,15:00,15:10,15:55"
sched
  #=> ["Teacher 1,Stu A,Stu B,Stu C",
  #    "Teacher 2,Stu B,Stu C,Stu A",
  #    "Teacher 3,Stu C,Stu A,Stu B"]
time_blocks = blks.scan(/\d{1,2}:\D{2}/).map do |s|
  hr, min = s.split(':')
  mins_from_midnight = 60*(hr.to_i) + min.to_i
  { start: "%d:%02d" % mins_from_midnight.divmod(60),
  { end: "%d:%02d" % (mins_from_midnight + mins_per_meeting).divmod(60),
end
  #=> [{:start=>"15:00", :end=>"15:08"},
  #    {:start=>"15:10", :end=>"15:18"},
  #    {:start=>"15:55", :end=>"16:03"},
sched.map do |s|
  teacher, *students = s.split(',')
  { name: teacher,
    email: teacher_emails[teacher],
    appointments: time_blocks.zip(students).map do |tb,stud|
      tb.merge(
        { student: stud,
          attendees: parent_emails[stud].map do |par_name, par_email|
            { name: par_name, email: par_email }
          end
        }
      )
    end    
  }
end
  #=> [{:name=>"Teacher 1", :email=>"teacher.1@school.edu",
  #     :appointments=>[
  #       {:start=>"15:00", :end=>"15:08",
  #        :student=>"Stu A",
  #        :attendees=>[
  #          {:name=>"Parent 1", :email=>"stuap1@example.com"},
  #          {:name=>"Parent 2", :email=>"stuap2@example.com"}
  #        ]
  #       },
  #       {:start=>"15:10", :end=>"15:18",
  #        :student=>"Stu B",
  #        :attendees=>[
  #          {:name=>"Parent", :email=>"stubp@example.com"}
  #        ]
  #       },
  #       {:start=>"15:55", :end=>"16:03",
  #        :student=>"Stu C",
  #        :attendees=>[
  #          {:name=>"Parent 1", :email=>"stuapc@example.com"},
  #          {:name=>"Parent 2", :email=>"stuapc@example.com"}
  #        ]
  #       }
  #     ]
  #    },
  #    {:name=>"Teacher 2", :email=>"teacher.2@school.edu",
  #     :appointments=>[
  #       {:start=>"15:00", :end=>"15:08",
  #        :student=>"Stu B",
  #        :attendees=>[
  #          {:name=>"Parent", :email=>"stubp@example.com"}
  #        ]
  #       },
  #       ....
于 2021-03-10T09:16:09.003 回答