python - 如何将字符串类型列的因子级别集中到pydatatable中的另一个列中？

Question

我有一个数据表，

DT_X = dt.Frame({'variety': ['Caturra',
  'Bourbon',
  'Typica',
  'Catuai',
  'Hawaiian Kona',
  'Yellow Bourbon',
  'Mundo Novo',
  'Catimor',
  'SL14',
  'SL28',
  'Pacas',
  'Gesha',
  'Pacamara',
  'SL34',
  'Arusha',
  'Peaberry',
  'Mandheling',
  'Sumatra',
  'Blue Mountain',
  'Ethiopian Yirgacheffe',
  'Java',
  'Ruiru 11',
  'Ethiopian Heirlooms',
  'Marigojipe',
  'Moka Peaberry',
  'Pache Comun',
  'Sulawesi',
  'Sumatra Lintong'],
 'count': [256,
  226,
  211,
  74,
  44,
  35,
  33,
  20,
  17,
  15,
  13,
  12,
  8,
  8,
  6,
  5,
  3,
  3,
  2,
  2,
  2,
  2,
  1,
  1,
  1,
  1,
  1,
  1]})

它可以被视为，

Out[8]: 
   | variety                count
-- + ---------------------  -----
 0 | Caturra                  256
 1 | Bourbon                  226
 2 | Typica                   211
 3 | Catuai                    74
 4 | Hawaiian Kona             44
 5 | Yellow Bourbon            35
 6 | Mundo Novo                33
 7 | Catimor                   20
 8 | SL14                      17
 9 | SL28                      15
10 | Pacas                     13
11 | Gesha                     12
12 | Pacamara                   8
13 | SL34                       8
14 | Arusha                     6
15 | Peaberry                   5
16 | Mandheling                 3
17 | Sumatra                    3
18 | Blue Mountain              2
19 | Ethiopian Yirgacheffe      2
20 | Java                       2
21 | Ruiru 11                   2
22 | Ethiopian Heirlooms        1
23 | Marigojipe                 1
24 | Moka Peaberry              1
25 | Pache Comun                1
26 | Sulawesi                   1
27 | Sumatra Lintong            1

我现在想用前 4 个级别“Caturra”、“Bourbon”、“Typica”、“Catuai”填写品种列，其余级别应视为“其他”。

预期的输出是：

Out[9]: 
   | variety  count
-- + -------  -----
 0 | Caturra    256
 1 | Bourbon    226
 2 | Typica     211
 3 | Catuai      74
 4 | Others     236

[5 rows x 2 columns]

案例二：

我有一个数据表，

DT_X_1 = dt.Frame({'variety': ['Bourbon',
  'Catimor',
  'Ethiopian Yirgacheffe',
  'Caturra',
  'Bourbon',
  'SL14',
  'Caturra',
  'Sumatra',
  'Bourbon',
  'Caturra',
  'SL34',
  'Hawaiian Kona',
  'Caturra',
  'Yellow Bourbon',
  'Yellow Bourbon',
  'Bourbon',
  'SL28',
  'Bourbon',
  'Caturra',
  'SL28',
  'Bourbon',
  'SL14',
  'Caturra',
  'Gesha',
  'Bourbon',
  'Catuai',
  'Caturra',
  'Bourbon',
  'Bourbon',
  'Hawaiian Kona']})

它可以被视为

Out[7]: 
   | variety              
-- + ---------------------
 0 | Bourbon              
 1 | Catimor              
 2 | Ethiopian Yirgacheffe
 3 | Caturra              
 4 | Bourbon              
 5 | SL14                 
 6 | Caturra              
 7 | Sumatra              
 8 | Bourbon              
 9 | Caturra              
10 | SL34                 
11 | Hawaiian Kona        
12 | Caturra              
13 | Yellow Bourbon       
14 | Yellow Bourbon       
15 | Bourbon              
16 | SL28                 
17 | Bourbon              
18 | Caturra              
19 | SL28                 
20 | Bourbon              
21 | SL14                 
22 | Caturra              
23 | Gesha                
24 | Bourbon              
25 | Catuai               
26 | Caturra              
27 | Bourbon              
28 | Bourbon              
29 | Hawaiian Kona        

[30 rows x 1 column]

列品种有大约 12 个不同的值，

Out[8]: 
   | variety                count
-- + ---------------------  -----
 0 | Bourbon                    9
 1 | Catimor                    1
 2 | Catuai                     1
 3 | Caturra                    7
 4 | Ethiopian Yirgacheffe      1
 5 | Gesha                      1
 6 | Hawaiian Kona              2
 7 | SL14                       2
 8 | SL28                       2
 9 | SL34                       1
10 | Sumatra                    1
11 | Yellow Bourbon             2

[12 rows x 2 columns]

在这里，我想将最常见的字段品种级别从 12 折叠到 2。

预期的输出将是，

Out[13]: 
   | variety
-- + -------
 0 | Bourbon
 1 | Others 
 2 | Others 
 3 | Caturra
 4 | Bourbon
 5 | Others 
 6 | Caturra
 7 | Others 
 8 | Bourbon
 9 | Caturra
10 | Others 
11 | Others 
12 | Caturra
13 | Others 
14 | Others 
15 | Bourbon
16 | Others 
17 | Bourbon
18 | Caturra
19 | Others 
20 | Bourbon
21 | Others 
22 | Caturra
23 | Others 
24 | Bourbon
25 | Others 
26 | Caturra
27 | Bourbon
28 | Bourbon
29 | Others 

[30 rows x 1 column]

score 3 · Accepted Answer

一种方法是首先variety用字符串“Other”替换从第 4 个开始的所有值，然后按以下方式分组variety：

>>> DT_X[4:, f.variety] = "Other"
>>> DT_X = DT_X[:, sum(f.count), by(f.variety)]
   | variety  count
-- + -------  -----
 0 | Bourbon    226
 1 | Catuai      74
 2 | Caturra    256
 3 | Other      236
 4 | Typica     211

[5 rows x 2 columns]

另一种可能性是获取原始表，将其按行分成两部分，折叠第二部分并 rbind 回到原来的：

>>> dt.rbind(DT_X[:4, :], 
             dt.Frame(variety=["Other"], count=[DT_X[4:, f.count].sum1()]))
   | variety  count
-- + -------  -----
 0 | Caturra    256
 1 | Bourbon    226
 2 | Typica     211
 3 | Catuai      74
 4 | Other      236

[5 rows x 2 columns]

案例2

您已经按品种创建了计数表，所以现在您只需按计数对其进行排序并选择 2 个最常见的品种：

>>> from datatable import by, sort, count, join, update, f, g
>>> counts = DT_X_1[:, count(), by(f.variety)]
>>> frequent = counts[-2:, :, sort(f.count)]
>>> frequent
   | variety  count
-- + -------  -----
 0 | Caturra      7
 1 | Bourbon      9

[2 rows x 2 columns]

（或者，您可以按计数值过滤）。

现在，下一步是将这个表连接回原来的表，这样我们就有了哪些值是“频繁”的指示符。join 操作可以与 update 结合使用，因此在同一个操作中，我们将 join 期间不匹配的所有字段设置为"others"：

>>> frequent.key = "variety"
>>> DT_X_1[g.variety==None, update(variety="others"), join(frequent)]
>>> DT_X_1
   | variety
-- + -------
 0 | Bourbon
 1 | others 
 2 | others 
 3 | Caturra
 4 | Bourbon
 5 | others 
 6 | Caturra
 7 | others 
 8 | Bourbon
 9 | Caturra
10 | others 
11 | others 
12 | Caturra
13 | others 
14 | others 
15 | Bourbon
16 | others 
17 | Bourbon
18 | Caturra
19 | others 
20 | Bourbon
21 | others 
22 | Caturra
23 | others 
24 | Bourbon
25 | others 
26 | Caturra
27 | Bourbon
28 | Bourbon
29 | others 

[30 rows x 1 column]

python - 如何将字符串类型列的因子级别集中到pydatatable中的另一个列中？

1 回答 1

案例2

Related

Reference