ruby - 如何从由 HTML 文件 (Ruby) 转换的 txt 文件创建数组？

Question

我正在尝试完成我们分配的第一项任务：

收到 5 封常规电子邮件和 5 封预付费欺诈电子邮件（也称为垃圾邮件）。将它们全部转换为文本文件，然后将每个转换为单词数组（此处拆分可能会有所帮助）。然后使用一堆正则表达式搜索查找关键字的单词数组来分类哪些文件是垃圾邮件。如果你想变得花哨，你可以给每个数组一个垃圾邮件分数（满分 10）。

打开 HTML 页面并读取文件。
从文件中剥离脚本、链接等。
拥有自己的身体/段落。
打开文本文件（file2）并写入它（UTF-8）。
从 HTML 文档（文件 1）传递内容。
现在将文本文件 (file2) 中的单词放入一个数组中，然后进行拆分。
通过数组查找任何被视为垃圾邮件的单词并将消息打印到屏幕上，说明电子邮件是否为垃圾邮件。

这是我的代码：

require 'nokogiri'
file = File.open("EMAILS/REG/Membership.htm", "r")
doc = Nokogiri::HTML(file)
#What ever is passed from elements to the newFile is being put into the new array however the euro sign doesn't appear correctly
elements = doc.xpath("/html/body//p").text
#puts elements

newFile = File.open("test1.txt", "w")
newFile.write(elements)
newFile.close()


#I want to open the file again and print the lines to the screen
#
array_of_words = {}
puts "\n\tRetrieving test1.txt...\n\n"
File.open("test1.txt", "r:UTF-8").each_line do |line|
    words = line.split(' ')
    words.each do |word|
        puts "#{word}"
        #array_of_words[word] = gets.chomp.split(' ')
    end
end

编辑：在这里我编辑了文件，但是，我无法检索数组中欧元符号的 UTF-8 编码（参见图片）。

require 'nokogiri'

doc = Nokogiri::HTML(File.open("EMAILS/REG/Membership.htm", "r:UTF-8"))

#What ever is passed from elements to the newFile is being put into the new 
#array however the euro sign doesn't appear correctly
elements = doc.xpath("//p").text
#puts elements

File.write("test1.txt", elements)

puts "\n\tRetrieving test1.txt...\n\n"

#I want to open the file again and print the lines to the screen
#
word_array = Array.new
File.read("test1.txt").each_line do |line|
    line.split(' ').each do |word|
        puts "#{word}"
        word_array << word
    end
end

score 0 · Accepted Answer

你让自己的事情变得更难了。您已经有了段落文本，elements因此无需test1.txt在写入后阅读。然后使用String#split不带参数拆分所有空格。

score 0 · Accepted Answer

因为这是一项任务，所以我不会尝试回答您应该如何执行此操作；你应该自己想办法。

我要做的是向您展示您应该如何编写已经完成的内容，并为您指明方向：

require 'nokogiri'

doc = Nokogiri::HTML(File.read("EMAILS/REG/Membership.htm"))

# What ever is passed from elements to the newFile is being put into the new
# array however the euro sign doesn't appear correctly
elements = doc.xpath("//p").text

File.write("test1.txt", elements)

print "\n\tRetrieving test1.txt...\n\n"

# I want to open the file again and print the lines to the screen
word_hash = {}
File.open("test1.txt", "r:UTF-8").each_line do |line|
  line.split(' ').each do |word|
    puts "#{word}"
    #word_hash[word] = gets.chomp.split(' ')
  end
end

Ruby 的许多 IO 方法和 File 的继承方法都可以利用块，当块退出时会自动关闭流。使用该功能，因为在应用程序的整个运行时保持文件打开是不好的。

array_of_words = {}没有定义数组，它是一个哈希。

#array_of_words[word] = gets.chomp.split(' ')由于gets想从哪里读取而无法工作。默认情况下，它是 STDIN，即控制台，即键盘。你已经到word了那个时候，所以用它做点什么。

但是想想，你基本上是在为贝叶斯滤波器创建基础。您需要计算单词的出现次数，因此仅将单词分配给散列不会让您获得想要知道的信息，您需要知道特定单词被看到了多少次。Stack Overflow 回答了很多关于如何计算在字符串中找到的单词数量的问题，因此请搜索这些问题。

ruby - 如何从由 HTML 文件 (Ruby) 转换的 txt 文件创建数组？

2 回答 2

Related

Reference