7

我是使用 python 数据表的新手,这是我正在关注的教程

如何过滤掉某个列中的值包含在列表中的行?

本质上这是我正在使用的代码:

    import datatable as dt
    sfr = dt.fread(os.path.join(dirName, 'Results.csv'))

sfr
Out[25]: 
         |       ioid  itemtype  date       itemid           tid  value
-------- + ----------  --------  --------  -------  ------------  -------
       0 |          1         1  7-1-2022     9015           531   0.0283
       1 |          1         1  7-1-2022     9015           532   0.0071
       2 |          1         1  7-1-2022     9016           534   0.0065
       3 |          1         1  7-1-2022     9017          1018   0.0005

我正在尝试执行以下操作

ids = [9016, 9017]
sft[dt.f.itemid.isin(ids)]

但是,我无法使语法正常工作。

期望的输出是:

         |       ioid  itemtype  date       itemid           tid  value
-------- + ----------  --------  --------  -------  ------------  -------
       2 |          1         1  7-1-2022     9016           534   0.0065
       3 |          1         1  7-1-2022     9017          1018   0.0005
4

1 回答 1

3

Because Pydatatable doesn't explicitly support filtering against the list of values (see feature request), the following solution may appear not quite intuitive. Still, it does exactly what such filtering would do by utilizing datatable join function:

ids = [9016, 9017]
f = dt.Frame(itemid = ids)
sfr.key = "itemid"
sfr_filtered = f[:, :, dt.join(sfr)]

These steps consist of:

  1. creating a frame (datable) that contains single column with the same name as the column being filtered and storing list of values inside this frame
  2. keying original frame on the column being filtered
  3. joining frames to perform filtering using datatable join

Here is a catch though: currently join has limitations - it's only left outer join and the join column (itemid in sfr in the solution above) must have unique values, otherwise sfr.key = "itemid" throws this error:

ValueError: Cannot set a key: the values are not unique

Because of these strong assumptions it won't work when itemid contains not unique values. In that case we turn join around and add additional filter on a dummy column added to the frame containing list:

ids = [9016, 9017]
f = dt.Frame(itemid = ids, dummy = [0, 0])
f.key = "itemid"
sfr_filtered = sfr[:, :, dt.join(f)][~dt.isna(dt.f.dummy), :]

This solution will always work but obviously less efficient due to always joining all rows and extra filtering to remove unmatched rows. For more details on current state of Pydatatable join you can find docs here.

UPDATE

For completeness I added another (arguably more straight forward) solution borrowed from this answer by Pasha (the creator and maintainer of Pydatatable):

import functools
import operator

filter = functools.reduce(operator.or_, (dt.f.itemid == id for id in ids))
sfr[filter, :]

Until this feature request https://github.com/h2oai/datatable/issues/699 is implemented one of the solutions shown above does filtering against list of values.

于 2021-06-01T06:55:29.837 回答