我正在尝试完成我们分配的第一项任务:
收到 5 封常规电子邮件和 5 封预付费欺诈电子邮件(也称为垃圾邮件)。将它们全部转换为文本文件,然后将每个转换为单词数组(此处拆分可能会有所帮助)。然后使用一堆正则表达式搜索查找关键字的单词数组来分类哪些文件是垃圾邮件。如果你想变得花哨,你可以给每个数组一个垃圾邮件分数(满分 10)。
- 打开 HTML 页面并读取文件。
- 从文件中剥离脚本、链接等。
- 拥有自己的身体/段落。
- 打开文本文件(file2)并写入它(UTF-8)。
- 从 HTML 文档(文件 1)传递内容。
- 现在将文本文件 (file2) 中的单词放入一个数组中,然后进行拆分。
- 通过数组查找任何被视为垃圾邮件的单词并将消息打印到屏幕上,说明电子邮件是否为垃圾邮件。
这是我的代码:
require 'nokogiri'
file = File.open("EMAILS/REG/Membership.htm", "r")
doc = Nokogiri::HTML(file)
#What ever is passed from elements to the newFile is being put into the new array however the euro sign doesn't appear correctly
elements = doc.xpath("/html/body//p").text
#puts elements
newFile = File.open("test1.txt", "w")
newFile.write(elements)
newFile.close()
#I want to open the file again and print the lines to the screen
#
array_of_words = {}
puts "\n\tRetrieving test1.txt...\n\n"
File.open("test1.txt", "r:UTF-8").each_line do |line|
words = line.split(' ')
words.each do |word|
puts "#{word}"
#array_of_words[word] = gets.chomp.split(' ')
end
end
编辑:在这里我编辑了文件,但是,我无法检索数组中欧元符号的 UTF-8 编码(参见图片)。
require 'nokogiri'
doc = Nokogiri::HTML(File.open("EMAILS/REG/Membership.htm", "r:UTF-8"))
#What ever is passed from elements to the newFile is being put into the new
#array however the euro sign doesn't appear correctly
elements = doc.xpath("//p").text
#puts elements
File.write("test1.txt", elements)
puts "\n\tRetrieving test1.txt...\n\n"
#I want to open the file again and print the lines to the screen
#
word_array = Array.new
File.read("test1.txt").each_line do |line|
line.split(' ').each do |word|
puts "#{word}"
word_array << word
end
end