python-2.7 - Python in Knime: Downloading files and dynamically pressing them into workflow

Question

I'm using Knime 3.1.2 on OSX and Linux for OPENMS analysis (Mass Spectrometry).

Currently, it uses static filename.mzML files manually put in a directory. It usually has more than one file pressed in at a time ('Input FileS' module not 'Input File' module) using a ZipLoopStart.

I want these files to be downloaded dynamically and then pressed into the workflow...but I'm not sure the best way to do that.

Currently, I have a Python script that downloads .gz files (from AWS S3) and then unzips them. I already have variations that can unzip the files into memory using StringIO (and maybe pass them into the workflow from there as data??).

It can also download them to a directory...which maybe can them be used as the source? But I don't know how to tell the ZipLoop to wait and check the directory after the python script is run.

I also could have the python script run as a separate entity (outside of knime) and then, once the directory is populated, call knime...HOWEVER there will always be a different number of files (maybe 1, maybe three)...and I don't know how to make the 'Input Files' knime node to handle an unknown number of input files.

I hope this makes sense. Thanks!

score 3 · Accepted Answer

感谢Gábor让我走上正轨。尽管经过多次实验，我最终做了一条略有不同的路线。

===

作为 Knime 的新手，我不知道这是对 Knime 的有效使用，还是完整的 Kluge ......但它确实有效。

所以，问题的一部分是一些 Knime 特定的对象——其中一个被称为URIDataValue。

显然，Python Pandas 数据框可以与 Knime 表互换。但是，我不知道是否有办法将这些 URIDataValue 对象之一导入 Python。所以这就是我所做的...

1. 我编写了一个 Python 脚本，它创建了一个 Pandas 数据框，并用一个列填充它。一切都是字符串，包括列标题：

from pandas import DataFrame
# Create empty table
T = DataFrame(
    [
        ['file:///Users/.../copy/lfq_spikein_dilution_1.mzML'], 
        ['file:///Users/.../copy/lfq_spikein_dilution_2.mzML'], 
    ], 
)
T.columns = ['URIDataValue']                        
#print T
output_table = T

这创建了这个数据框：

注意：列名和值只是字符串。但是（显然）列标题是“URIDataValue”很重要......即使在这里它只是文本。如果列名不是“URIDataValue”，则下一个节点不知道该做什么。

接下来，来自“Python Source”节点的“output_table”被修补到“String to URI”节点，该节点（显然和神奇地）知道将整个列的字符串值更改为 URIDataValues（可能基于第一列的名称...不确定）。

最后，带有正确数据对象的新表转到“URI to PORT”节点……因为显然“Port”对象和“URI”对象是不同的。

然后，这将所需的输入与 ZipLoop 相匹配……这通常是静态（硬编码）“输入文件”节点的输出。

现在，要真正解决上述问题，我只需将代码添加到我的“Python 源代码”以下载和解压缩 S3 文件，然后用它们的位置注释数据框，然后继续。

我不知道我在做什么，但它奏效了。

score 1 · Accepted Answer

有多种选择可以让事情正常进行：

使用 Python 将内存中的文件转换为二进制对象单元格，稍后您可以在 KNIME 中使用它。（这个，我不确定是否支持，但我记得它是在最后一次 KNIME 聚会中演示的。）
Create Temp Dir使用 Python 将List Files文件保存到临时文件夹Iterate List of Files（
也许 KNIME 中已经支持 S3 远程文件处理，因此您可以在 KNIME 中进行下载、解压缩。（我不知道，但它会很好。）

我会选择选项 2，但我对 Python 不太熟悉，所以对你来说，选项 1 可能是最好的。（如果支持选项 3，我认为这是最好的。）

python-2.7 - Python in Knime: Downloading files and dynamically pressing them into workflow

2 回答 2

Related

Reference