I have a large dataframe (500K rows x 100 cols) and want to do the following search-and-masking operation efficiently, but I can't find the right pandas/numpy incantation; better still if it can be vectorized:
- on each row, the N columns
m1,m2,...,m6
can contain distinct values from 1..9, or else trailing NaNs. (The NaNs are there for a very good reason, to prevent aggregation/taking sum/mean/etc. on nonexistent records when we process the output from this step; it is very strongly desirable that you preserve the NaNs)- distinctness: it is guaranteed that the columns
m<i>
will contain at most one occurrence of each of the values 1..9
- distinctness: it is guaranteed that the columns
- columns
x1,x2,...,x6
are associated with the columnsm<i>
, and contain some integer values - For each possible value
v
in range 1..9 (I will manually sweep v from 1:9 at top-level of my analysis, don't worry about that part), I want to do the following:- on each row where that value
v
occurs in one of them<i>
, find which columnm<i>
equalsv
(either as boolean mask/array/indices/anything else you prefer) - on rows where
v
doesn't occur inm<i>
, preferably I don't want any result for that row, not even NaN - then I want to use that intermediate boolean mask/array/indices/whatever to slice the corresponding value from the
x<i>
(x1,x2,...,x6
) on that row
- on each row where that value
Here's my current code; I tried iloc
, melt
, stack/unstack
, mask
, np.where
, np.select
and other things but can't get the desired result:
import numpy as np
from numpy import nan
import pandas as pd
N = 6 # the width of our column-slices of interest
# Sample dataframe
dat = pd.compat.StringIO("""
text,m1,m2,m3,m4,m5,m6,x1,x2,x3,x4,x5,x6\n
'foo',9,3,4,2,1,, 21,22,23,24,25,26\n
'bar',2,3,4,6,5,, 31,32,33,34,35,36\n
'baz',7,3,4,1,,, 11,12,13,14,15,16\n
'qux',2,6,3,4,7,, 41,42,43,44,45,46\n
'gar',3,1,4,7,,, 51,52,53,54,55,56\n
'wal',3,,,,,, 11,12,13,14,15,16\n
'fre',2,3,4,6,5,, 61,62,63,64,65,66\n
'plu',2,3,4,9,1,, 71,72,73,74,75,76\n
'xyz',2,3,4,9,6,1, 81,82,83,84,85,86\n
'thu',1,3,6,4,5,, 51,52,53,54,55,56""".replace(' ',''))
df = pd.read_csv(dat, header=[0])
v = 1 # For example; Actually we want to sweep v from 1:9 ...
# On each row, find the index 'i' of column 'm<i>' which equals v; or NaN if v doesn't occur
df.iloc[:, 1:N+1] == v
(df.iloc[:, 1:N+1] == 1).astype(np.int64)
# m1 m2 m3 m4 m5 m6
# 0 0 0 0 0 1 0
# 1 0 0 0 0 0 0
# 2 0 0 0 1 0 0
# 3 0 0 0 0 0 0
# 4 0 1 0 0 0 0
# 5 0 0 0 0 0 0
# 6 0 0 0 0 0 0
# 7 0 0 0 0 1 0
# 8 0 0 0 0 0 1
# 9 1 0 0 0 0 0
# np.where() seems useful...
_ = np.where((df.iloc[:, 1:N+1] == 1).astype(np.int64))
# (array([0, 2, 4, 7, 8, 9]), array([4, 3, 1, 4, 5, 0]))
# But you can't directly use df.iloc[ np.where((df.iloc[:, 1:N+1] == 1).astype(np.int64)) ]
# Feels like you want something like df.iloc[ *... ] where we can pass in our intermediate result as separate vectors of row- and col-indices
# can't unpack the np.where output into separate row- and col- indices vectors
irow,icol = *np.where((df.iloc[:, 1:N+1] == 1).astype(np.int64))
SyntaxError: can't use starred expression here
# ...so unpack manually...
irow = _[0]
icol = _[1]
# ... but now can't manage to slice the `x<i>` with those...
df.iloc[irow, 7:13] [:, icol.tolist()]
TypeError: unhashable type: 'slice'
# Want to get numpy-type indexing, rather than pandas iloc[]
# This also doesn't work:
df.iloc[:, 7:13] [list(zip(*_))]
# Want to slice into the x<i> which are located in df.iloc[:, N+1:2*N+1]
# Or any alternative faster numpy/pandas implementation...