12

我有大约 4000 篇博客文章。我想根据以下值对所有帖子进行排名

Upvote Count => P
Comments Recieved => C
Share Count => S
Created time in Epoch => E
Follower Count of Category which post belongs to => F (one post has one category)
User Weight => U (User with most number of post have biggest weight)

我期待伪代码的答案。

4

2 回答 2

25

Your problem falls into the category of regression (link). In machine learning terms, you have a collection of features (link) (which you list in your question) and you have a score value that you want to predict given those features.

What Ted Hopp has suggested is basically a linear predictor function (link). That might be too simple a model for your scenario.

Consider using logistic regression (link) for your problem. Here's how you would go about using it.

1. create your model-learning dataset

Randomly select some m blog posts from your set of 4000. It should be a small enough set that you can comfortably look through these m blog posts by hand.

For each of the m blog posts, score how "good" it is with a number from 0 to 1. If it helps, you can think of this as using 0, 1, 2, 3, 4 "stars" for the values 0, 0.25, 0.5, 0.75, 1.

You now have m blog posts that each have a set of features and a score.

You can optionally expand your feature set to include derived features - for example, you could include the logarithm of the "Upvote Count," the "Comments Recieved", the "Share Count," and the "Follower Count," and you could include the logarithm of the number of hours between "now" and "Created Time."

2. learn your model

Use gradient descent to find a logistic regression model that fits your model-learning dataset. You should partition your dataset into training, validation, and test sets so that you can carry out those respective steps in the model-learning process.

I won't elaborate any more on this section because the internet is full of the details and it's a canned process.

Wikipedia links:

3. apply your model

Having learned your logistic regression model, you can now apply it to predict the score for how "good" a new blog post is! Simply compute the set of features (and derived features), then use your model to map those features to a score.

Again, the internet is full of the details for this section, which is a canned process.


If you have any questions, make sure to ask!

If you're interested in learning more about machine learning, you should consider taking the free online Stanford Machine Learning course on Coursera.org. (I'm not affiliated with Stanford or Coursera.)

于 2013-06-11T07:09:11.860 回答
13

我建议对每篇博客文章的个人分数进行加权平均。分配一个既反映每个值的相对重要性又反映值尺度差异的权重(例如,E与其他值相比将是一个非常大的数字)。然后计算:

rank = wP * P + wC * C + wS * S + wE * E + wF * F + wU * U;

您没有提供有关每个值的相对重要性的任何信息,甚至没有提供这些值在排名方面的含义。所以不可能更具体地说明这一点。(较早的创建时间是否会使帖子的排名上升或下降?如果下降,那么wE应该是负数。)

于 2013-06-11T04:59:34.107 回答