1

我有一种方法可以计算字符串中单词的频率。我手动包含了一些应该删除的单词。我发现对于短字符串,“the”被删除......对于较长的字符串,例如下面的字符串,该方法仍然打印“the”。关于为什么会这样以及如何解决它的任何想法?

def count_words(string)
    words = string.downcase.split(' ')

    delete_list = ['the']
    delete_list.each do |del|
        words.delete_at(words.index(del))
    end

    frequency = Hash.new(0)
    words.each do |word|
        frequency[word.downcase] += 1
    end

    return frequency.sort_by {|k,v| v}.reverse
end

puts count_words('Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
sales metrics often do not reflect the contributions of the role, which demonstrates that line management is out of touch of what the individual contributors role really does
middle management does not care about the career of his/her directs, 90% of the time management competes directly with their people, or takes credit for their work
lots of back stabbing going on
Microsoft changes the organization or commitment or comp model, faster than the average deal cycle, making it next to near impossible to develop momentum in role or a rhythm of success
execs promote themselves in years when they freeze employees merit increases
only way to advance is to step on your peers/colleagues and take credit for work you had no impact on, beat your chest loud enough and you get "visibility" you need to advance
visibility is not based on performance by enlarge, it is based on being in your manager\'s swim lane for advancement
I have observed people get promoted in years when they did not meet their quota, nor did the earn the highest performance on the team, they kissed their way to the promotion
Advice to Senior Management 1, get back to risk taking and teaming, less politics please, you are killing the company
2, set realistic commitments and stick to them for multiple years, stop changing the game faster than your people can react
3, stop over engineering commitments and over segmenting the company, people are not willing to collaborate or be corporate citizens
4, too many empty suits in middle management, keep flattening out the company and getting rid of middle managers that run reports all day, get back to a culture where managers also sell and drives wins
5, keep your word microsoft, you said stability, but you keep tinkering with the org too much for any changes to take affect A great Culture
Limitless opportunities
Supportive Management team who are passionate about people
A company that really does want you to have a good work life balance and backs it up with policies that enable you to manage how and where you work.
Cons Support resources are constrained
Can be overly competitve and hard to get noticed
Sales rewards are definitely prioritised and marketing cuts are always prioritised.
Consumer organisation is still far from ideal.
Advice to Senior Management Focus on getting the internal organisation simplified to improve performance and increase empowerment.
Get some REAL consumer focus and invest for the long term
Start connecting with people, focussing on telling stories rather than selling products.')
4

2 回答 2

1

只需使用words.delete("the"). 您需要做的就是给它钥匙。

您的程序的更简单版本是:

def count_words(string)
  words = string.downcase.split(' ').each_with_object(Hash.new(0)) { |w,o| o[w] += 1 }

  delete_list = ['the']

  delete_list.each { |del| words.delete(del) }

  frequency.sort_by {|k,v| v}.reverse
end
于 2013-04-10T01:57:54.760 回答
1

在分析 SEO 网页时,这是一个非常常见的问题。这是我写的快速版本:

require 'pp'

STOP_WORDS = %w[a and of the]

def count_words(string)

  word_count = string
    .downcase
    .gsub(/[^a-z ]+/, '')
    .split
    .group_by{ |w| w }

  STOP_WORDS.each do |stop_word|
    word_count.delete(stop_word)
  end

  word_count
    .map{ |k,v| [k, v.size]}
    .sort_by{ |k, c| [-c, k] }
end

pp count_words(<<EOT)
Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
Start connecting with people, focussing on telling stories rather than selling products.
EOT

为了便于阅读,我特意截断了示例数据。

<<在该主题上,当您必须传入大量文本时,您可以使用 here-to (" ") 来改进代码的格式。另一种方法是插入一个__END__标记并将其全部放在它之后,然后使用特殊的 IO 对象DATA来读取该尾随块:

pp count_words(DATA.read)

__END__
Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
Start connecting with people, focussing on telling stories rather than selling products.

无论哪种情况,代码都会输出:

[[“的”,2],
 [“和”, 1],
 [“是”,1],
 [“好处”,1],
 [“建筑开始”,1],
 [“补偿”,1],
 [“连接”,1],
 [“缺点”,1],
 [“控制”,1],
 [“空”,1],
 [“公平”,1],
 [“聚焦”,1],
 [“大门”,1],
 [“得到”,1],
 [“伟大”,1],
 [“有”,1],
 [“左”,1],
 [“小”,1],
 [“管理”,1],
 [“中间”,1],
 [“关闭”,1],
 [“开”,1],
 [“出”,1],
 [“人”,1],
 [“产品”,1],
 [“优点”,1],
 [“相当”,1],
 [“合理”,1],
 [“风险”,1],
 [“销售”,1],
 [“自从”,1],
 [“故事”,1],
 [“西装”,1],
 [“采取政治”,1],
 [“告诉”,1],
 [“比”,1],
 [“时间”,1],
 [“非常”,1],
 [“愿景”,1],
 [“无效”,1],
 [“与”,1]]

gsub(/[^a-z ]+/, '')去掉任何不是字母或空格的东西。Enumerablegroup_by正在做繁重的工作。此外,Enumerablesort_by可以很容易地通过计数和单词进行反向排序。

在删除停用词时,我使用哈希而不是数组,因为迭代STOP_WORD列表通常比尝试迭代语料库中的单词更快。一个大语料库中的单词很可能比停用词多得多。

于 2013-04-10T03:26:11.567 回答