7

As a student of computational linguistics, I frequently do machine learning experiments where I have to prepare training data from all kinds of different resources like raw or annotated text corpora or syntactic tree banks. For every new task and every new experiment I write programs (normally in Python and sometimes Java) to extract the features and values I need and transform the data from one format to the other. This usually results in a very large number of very large files and a very large number of small programs which process them in order to get the input for some machine learning framework (like the arff files for Weka).

One needs to be extremely well organised to deal with that and program with great care not to miss any important peculiarities, exceptions or errors in the tons of data. Many principles of good software design like design patterns or refactoring paradigms are no big use for these tasks because things like security, maintainability or sustainability are of no real importance - once the program successfully processed the data one doesn't need it any longer. This has gone so far that I even stopped bothering about using classes or functions at all in my Python code and program in a simple procedural way. The next experiment will require different data sets with unique characteristics and in a different format so that their preparation will likely have to be programmed from scratch anyway. My experience so far is that it's not unusual to spend 80-90% of a project's time on the task of preparing training data. Hours and days go by only on thinking about how to get from one data format to another. At times, this can become quite frustrating.

Well, you probably guessed that I'm exaggerating a bit, on purpose even, but I'm positive you understand what I'm trying to say. My question, actually, is this:

Are there any general frameworks, architectures, best practices for approaching these tasks? How much of the code I write can I expect to be reusable given optimal design?

4

2 回答 2

2

我发现自己主要使用 GNU coreutils 中的 textutils 和 flex 进行语料库准备,将事物链接在简单的脚本中,至少当我需要做的准备工作足够简单时,可以使用正则表达式和微不足道的过滤等。

仍然可以使事物可重用,一般规则也适用于此。如果您在编程时不考虑最佳实践等,而只是按程序进行编程,恕我直言,难怪在开始一个新项目时您必须从头开始做所有事情。

尽管格式要求会有很大差异,但仍有许多常见任务,即。标签剥离,标签翻译,选择,制表,一些琐碎的数据收集,例如标记的数量,句子等。对这些旨在实现高可重用性的任务进行编程将获得回报,即使一开始需要更长的时间。

于 2010-01-14T18:33:14.173 回答
1

我不知道有任何这样的框架——并不意味着它们不存在。我更喜欢使用我自己的,它只是我随着时间的推移改进/调整/借用的代码片段的集合,我可以根据问题以各种配置链接在一起。如果您已经了解 python,那么我强烈建议您在 NumPy 中处理所有数据准备——如您所知,ML 数据集往往很大——成千上万的行向量包含浮点数。NumPy 在这类事情上非常出色。此外,我可能会建议,在为 ML 准备训练数据时,几乎每一次这样的工作都会出现一些任务,并且从一个问题到下一个问题并没有太大的不同。我在下面为您提供了这些片段。

归一化(缩放和平均中心化您的数据以避免超重。我相信您知道,您可以缩放 -1 到 1 或 0 到 1。我通常选择后者,以便我可以利用稀疏模式。在 python ,使用 NumPy 库:

import numpy as NP
data = NP.linspace( 1, 12, 12).reshape(4, 3)
data_norm = NP.apply_along_axis( lambda x : (x - float(x.min())) / x.max(), 
                                             0, data )

交叉验证(这里我将默认参数设置为“5”,因此测试集为 5%,训练集为 95%——将其放入函数中会使 k-fold 更简单)

def divide_data(data, testset_size=5) :
  max_ndx_val = data.shape[0] -1
  ndx2 = NP.random.random_integers(0, max_ndx_val, testset_size)
  TE = data_rows[ndx2]
  TR = NP.delete(data, ndx2, axis=0)
  return TR, TE

最后,这是一个出色的案例研究(恕我直言),既清晰又完整,从字面上展示了从原始数据收集到输入到 ML 算法(在本例中为 MLP)的整个过程。他们还提供了他们的代码。

于 2010-01-15T01:00:52.107 回答