3

I'm looking for information on how should a Python Machine Learning project be organized. For Python usual projects there is Cookiecutter and for R ProjectTemplate.

This is my current folder structure, but I'm mixing Jupyter Notebooks with actual Python code and it does not seems very clear.

.
├── cache
├── data
├── my_module
├── logs
├── notebooks
├── scripts
├── snippets
└── tools

I work in the scripts folder and currently adding all the functions in files under my_module, but that leads to errors loading data(relative/absolute paths) and other problems.

I could not find proper best practices or good examples on this topic besides this kaggle competition solution and some Notebooks that have all the functions condensed at the start of such Notebook.

4

2 回答 2

8

我们已经启动了一个 cookiecutter-data-science 项目,专为您可能感兴趣的 Python 数据科学家而设计,请在此处查看。结构在这里解释。

如果你有反馈,我会很高兴!请随时在此处回复、打开 PR 或提交问题。


针对您关于通过将 .py 文件导入笔记本来重用代码的问题,我们团队发现的最有效的方法是附加到系统路径。这可能会让一些人畏缩,但它似乎是将代码导入笔记本的最干净的方法,无需大量模块样板和 pip -e 安装。

一个技巧是在上面使用%autoreloadand%aimport 魔法。这是一个例子:

# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport preprocess.build_features

上面的代码来自这个笔记本的第 3.5 节,用于某些上下文。

于 2016-04-19T16:21:47.667 回答
1

你可能想看看:

http://tshauck.github.io/Gloo/

loo 的目标是将许多定期发生的数据分析操作联系在一起,并使这些过程变得简单。自动将数据加载到 ipython 环境中,运行脚本,使实用程序功能可用等等。这些是必须经常做的事情,但不是有趣的部分。

它没有得到积极的维护,但基础知识就在那里。

于 2016-02-17T09:53:07.797 回答