20

我打算用Java开发程序来提供诊断。数据集分为两部分,一是训练,二是测试。我的程序应该学会从训练数据中进行分类(顺便说一句,其中包含新列中每个问题的 30 个问题的答案,新行中的每条记录最后一列将是诊断 0 或 1,在数据诊断列的测试部分将为空 -数据集包含大约 1000 条记录),然后在测试部分数据中进行预测:/

我从来没有做过类似的事情,所以我会感谢任何关于解决类似问题的建议或信息。

我在考虑Java 机器学习库或Java 数据挖掘包,但我不确定它是否是正确的方向......?我仍然不确定如何应对这个挑战......

请指教。

祝一切顺利!

4

5 回答 5

13

I strongly recommend you use Weka for your task
Its a collection of machine learning algorithms with a user friendly front-end which facilitates a lot of different kinds of feature and model selection strategies
You can do a lot of really complicated stuff using this without really having to do any coding or math
The makers have also published a pretty good textbook that explains the practical aspects of data mining
Once you get the hang of it, you could use its API to integrate any of its classifiers into your own java programs

于 2009-12-03T01:16:43.243 回答
7

Hi As Gann Bierner said, this is a classification problem. The best classification algorithm for your needs I know of is, Ross Quinlan algorithm. It's conceptually very easy to understand.

For off-the-shelf implementations of the classification algorithms, the best bet is Weka. http://www.cs.waikato.ac.nz/ml/weka/. I have studied Weka but not used, as I discovered it a little too late.

I used a much simpler implementation called JadTi. It works pretty good for smaller data sets such as yours. I have used it quite a bit, so can confidently tell so. JadTi can be found at:

http://www.run.montefiore.ulg.ac.be/~francois/software/jaDTi/

Having said all that, your challenge will be building a usable interface over web. To do so, the dataset will be of limited use. The data set basically works on the premise that you have the training set already, and you feed the new test dataset in one step, and you get the answer(s) immediately.

But my application, probably yours also, was a step by step user discovery, with features to go back and forth on the decision tree nodes.

To build such an application, I created a PMML document from my training set, and built a Java Engine that traverses each node of the tree asking the user to give an input (text/radio/list) and use the values as inputs to the next possible node predicate.

The PMML standard can be found here: http://www.dmg.org/ Here you need the TreeModel only. NetBeans XML Plugin is a good schema-aware editor for PMML authoring. Altova XML can do a better job, but costs $$.

It is also possible to use an RDBMS to store your dataset and create the PMML automagically! I have not tried that.

Good luck with your project, please feel free to let me know if you need further inputs.

于 2009-12-03T01:30:06.670 回答
6

有多种算法属于“机器学习”类别,哪种算法适合您的情况取决于您处理的数据类型。

如果您的数据基本上由一组问题到一组诊断的映射组成,每个诊断都可以是是/否,那么我认为可能有效的方法包括神经网络和基于测试数据自动构建决策树的方法.

我会看一些标准文本,例如 Russel & Norvig(“人工智能:现代方法”)和其他对 AI/机器学习的介绍,看看您是否可以轻松地将他们提到的算法调整到您的特定数据. 另请参阅 O'Reilly,“Programming Collective Intelligence”,了解可能适用于您的情况的一种或两种算法的一些示例 Python 代码。

如果你能看懂西班牙语,墨西哥出版社 Alfaomega 近年来也发表了各种不错的 AI 相关介绍。

于 2009-12-03T01:10:40.117 回答
6

这是一个分类问题,而不是真正的数据挖掘。一般的方法是从每个数据实例中提取特征,让分类算法从特征和结果中学习模型(对你来说是 0 或 1)。大概你的 30 个问题中的每一个都有它自己的特点。

您可以使用许多分类技术。支持向量机和最大熵一样流行。我没有使用过 Java 机器学习库,但乍一看我没有看到其中任何一个。OpenNLP 项目具有最大熵实现。LibSVM 有一个支持向量机实现。您几乎可以肯定必须将您的数据修改为图书馆可以理解的内容。

祝你好运!

Update: I agree with the other commenter that Russel and Norvig is a great AI book which discusses some of this. Bishop's "Pattern Recognition and Machine Learning" discusses classification issues in depth if you're interested in the down and dirty details.

于 2009-12-03T01:11:10.867 回答
3

Your task is classical for neural networks, which are intended first of all to solve exactly classification tasks. Neural network has rather simple realization in any language, and it is the "mainstream" of "machine learning", closer to AI than anything other. You just implement (or get existing implementation) standart neural network, for example multilayered network with learning by error back propagation, and give it learning examples in cycle. After some time of such learning you will get it working on real examples. You can read more about neural networks starting from here: http://en.wikipedia.org/wiki/Neural_network http://en.wikipedia.org/wiki/Artificial_neural_network Also you can get links to many ready implementations here: http://en.wikipedia.org/wiki/Neural_network_software

于 2009-12-16T06:57:16.780 回答