问题标签 [data-mining]

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

0 投票
1 回答
461 浏览

ssas - 学习如何在 SSAS 中实现朴素贝叶斯分类器的最佳资源是什么?

问完这个问题后,我决定尝试使用 SQL Server Analysis Services 实现一些朴素贝叶斯分类器。

谁能给我指出一本关于如何在 SSAS 中实现朴素贝叶斯分类器的好书、网站或任何其他资源?同样,我会对学习决策树感兴趣。

0 投票
8 回答
4039 浏览

python - 平滑不规则采样的时间数据

给定一个表格,其中第一列是某个参考点后的秒数,第二列是任意测量值:

如您所见,测量是在不规则的时间点进行采样的。我需要通过平均每次测量前 100 秒的读数(在 Python 中)来平滑数据。由于数据表很大,因此确实首选基于迭代器的方法。不幸的是,经过两个小时的编码,我无法找到有效而优雅的解决方案。

谁能帮我?

编辑_

  1. 我希望每个原始读数都有一个平滑读数,并且平滑读数是原始读数和前 100(增量)秒内任何其他读数的算术平均值。(约翰,你是对的)

  2. 巨大的 ~ 1e6 - 10e6 行 + 需要使用紧凑的 RAM

  3. 数据近似随机游走

  4. 数据已排序

解析度

我已经测试了 J Machin 和 yairchu 提出的解决方案。他们都给出了相同的结果,但是,在我的数据集上,J Machin 的版本呈指数增长,而 yairchu 的版本是线性的。以下是由 IPython 的%timeit测量的执行时间(以微秒为单位):

谢谢大家的帮助。

0 投票
1 回答
53 浏览

mysql - mysql search prepending "the" or "and/&" ambiguity

I'm trying to do a title search in mysql across two different databases to match up data from seperate sources. In both db1 or db2, the titles will sometimes start with "The first title" in one db, and just "first title" in the other db, or "far and away" vs "far & away".

Mysql fulltext search doesn't seem very effective at figuring this out. I currently do just a straight match "WHERE title1=title2", but this of course misses any connection where there is slight differences in the title.

The only solution I have come up with is to run through a series of if statements checking if either of the titles contains "the" or "&".

This isn't a horrible way of doing it, but I assume there is a more efficient method to write my query to handle these issues.

Any ideas? So far my online searches have been fruitless. Thanks

0 投票
4 回答
693 浏览

data-mining - Data mining and Business Intelligence Technologies

I've noticed an increasing number of jobs that are asking for experience with data mining and business intelligence technologies. This sounds like an incredibly broad topic but where would one go if they wanted to develop at least a partial understanding of this stuff if it were to come up in an interview?

0 投票
7 回答
2088 浏览

data-mining - 从开发人员的角度来看,什么是数据挖掘?

我可以在一本书或维基百科上找到关于什么是数据挖掘的技术解释,但我想知道它究竟涉及什么样的开发?更多是关于使用工具还是更多关于编写工具?在研发方面,它真的与其他领域有什么不同吗?

0 投票
1 回答
809 浏览

c# - 如何对各种新闻来源进行数据挖掘?

我正在开发一个免费的网络应用程序,它将全天分析头条新闻并提供统计数据。大多数新闻网站都提供 RSS 提要,可以很好地了解要检索哪些故事。然而,当试图从新闻网站本身获取完整的新闻报道时,就会出现问题。目前,我为每个来源(CNN、纽约时报等)都有单独的NewsSource类,它们读取适当的 RSS 提要、跟踪每个链接并去除正文。当新闻网站决定更改其文章的 HTML 结构时,这似乎很乏味且非常难以管理。

是否有一项服务(最好是免费的)已经将多个新闻来源与完整的文章内容(不仅仅是摘要)聚合在一起?如果没有,您对处理具有不同 HTML 结构的多个来源有什么建议,这些来源可能会在没有通知的情况下发生变化?

0 投票
4 回答
8402 浏览

java - 什么是 Java 数据挖掘,JDM?

我在看JDM。这仅仅是与其他进行实际数据挖掘的工具交互的 API 吗?或者这是一组包含实际数据挖掘算法的包?

0 投票
3 回答
170 浏览

ruby-on-rails - Calculating item counters for a set of selected categories

In our Ruby on Rails project we have a lot of categorization criteria for recipes, such as cook method, occasion etc. Every recipe belongs to one or several of these categories. When someone starts browsing for recipes, he/she can narrow down to a set of particular categories. Then we need to calculate the number of recipes in all categories accessible from this set ("accessible" means there are recipes in that category that also belong to selected categories). This is similar to how Amazon search works: someone enters 'Software' and there is a menu on the left which says "Books (200)", "Movies (300)" etc, so user can go deeper by clicking on these links.

Right now we've implemented it roughly like that:

  1. Build a set of selected categories from URL;
  2. Perform a query that fetches category ids from all recipes that fall into currently selected criteria;
  3. Build the index which maps all category ids to counts of recipes, and render only those that have non-zero counters;
  4. Store this index in memcached for 24 hours, so we only calculate it once per day for a particular page.

My concern is that if there is a cache miss, building index can take a lot of time. Maybe you have any suggestions how to solve this problem or improve current solution?

0 投票
1 回答
323 浏览

data-mining - 类似 Netflix 的比赛

有谁知道任何类似于Netflix Prize的竞赛或任务?这不仅关乎金钱,还关乎数据的维度,与具有挑战性的任务的紧密联系。

0 投票
7 回答
1360 浏览

open-source - 开源数据挖掘软件

我在想; 什么是我可以用于非二进制关联规则生成的最佳开源软件。我需要一个非二进制实现,因为将我当前的非二进制数据转换为二进制数据不会产生预期的结果。

谢谢,迫不及待地想在这里发表您的评论!