0

I have a large dataframe (500K rows x 100 cols) and want to do the following search-and-masking operation efficiently, but I can't find the right pandas/numpy incantation; better still if it can be vectorized:

  • on each row, the N columns m1,m2,...,m6 can contain distinct values from 1..9, or else trailing NaNs. (The NaNs are there for a very good reason, to prevent aggregation/taking sum/mean/etc. on nonexistent records when we process the output from this step; it is very strongly desirable that you preserve the NaNs)
    • distinctness: it is guaranteed that the columns m<i> will contain at most one occurrence of each of the values 1..9
  • columns x1,x2,...,x6 are associated with the columns m<i>, and contain some integer values
  • For each possible value v in range 1..9 (I will manually sweep v from 1:9 at top-level of my analysis, don't worry about that part), I want to do the following:
    • on each row where that value v occurs in one of the m<i>, find which column m<i> equals v (either as boolean mask/array/indices/anything else you prefer)
    • on rows where v doesn't occur in m<i>, preferably I don't want any result for that row, not even NaN
    • then I want to use that intermediate boolean mask/array/indices/whatever to slice the corresponding value from the x<i> (x1,x2,...,x6) on that row

Here's my current code; I tried iloc, melt, stack/unstack, mask, np.where, np.select and other things but can't get the desired result:

import numpy as np
from numpy import nan
import pandas as pd

N = 6 # the width of our column-slices of interest

# Sample dataframe
dat = pd.compat.StringIO("""
text,m1,m2,m3,m4,m5,m6,x1,x2,x3,x4,x5,x6\n
'foo',9,3,4,2,1,,      21,22,23,24,25,26\n
'bar',2,3,4,6,5,,      31,32,33,34,35,36\n
'baz',7,3,4,1,,,       11,12,13,14,15,16\n
'qux',2,6,3,4,7,,      41,42,43,44,45,46\n
'gar',3,1,4,7,,,       51,52,53,54,55,56\n
'wal',3,,,,,,          11,12,13,14,15,16\n
'fre',2,3,4,6,5,,      61,62,63,64,65,66\n
'plu',2,3,4,9,1,,      71,72,73,74,75,76\n
'xyz',2,3,4,9,6,1,     81,82,83,84,85,86\n
'thu',1,3,6,4,5,,      51,52,53,54,55,56""".replace(' ',''))

df = pd.read_csv(dat, header=[0])

v = 1 # For example; Actually we want to sweep v from 1:9 ...

# On each row, find the index 'i' of column 'm<i>' which equals v; or NaN if v doesn't occur

df.iloc[:, 1:N+1] == v

(df.iloc[:, 1:N+1] == 1).astype(np.int64)
#    m1  m2  m3  m4  m5  m6
# 0   0   0   0   0   1   0
# 1   0   0   0   0   0   0
# 2   0   0   0   1   0   0
# 3   0   0   0   0   0   0
# 4   0   1   0   0   0   0
# 5   0   0   0   0   0   0
# 6   0   0   0   0   0   0
# 7   0   0   0   0   1   0
# 8   0   0   0   0   0   1
# 9   1   0   0   0   0   0

# np.where() seems useful...
_ = np.where((df.iloc[:, 1:N+1] == 1).astype(np.int64))
# (array([0, 2, 4, 7, 8, 9]), array([4, 3, 1, 4, 5, 0]))

# But you can't directly use df.iloc[ np.where((df.iloc[:, 1:N+1] == 1).astype(np.int64)) ]
# Feels like you want something like df.iloc[ *... ] where we can pass in our intermediate result as separate vectors of row- and col-indices

# can't unpack the np.where output into separate row- and col- indices vectors
irow,icol = *np.where((df.iloc[:, 1:N+1] == 1).astype(np.int64))
SyntaxError: can't use starred expression here

# ...so unpack manually...
irow = _[0]
icol = _[1]
# ... but now can't manage to slice the `x<i>` with those...
df.iloc[irow, 7:13] [:, icol.tolist()] 
TypeError: unhashable type: 'slice'

# Want to get numpy-type indexing, rather than pandas iloc[]
# This also doesn't work:
df.iloc[:, 7:13] [list(zip(*_))]

# Want to slice into the x<i> which are located in df.iloc[:, N+1:2*N+1]

# Or any alternative faster numpy/pandas implementation...
4

1 回答 1

1

For readability, and to avoid float notation in df, I first used the following instruction to change NaN values to 0 and change their type to int:

df.fillna(0, downcast='infer', inplace=True)

SOLUTION 1

And now get down to the main task, for v == 1. Start with:

x1 = np.argwhere(df.iloc[:, 1:N+1].values == v)

The result is:

[[0 4]
 [2 3]
 [4 1]
 [7 4]
 [8 5]
 [9 0]]

They are indices of elements == v in the subset of df.

Then, to "shift" to indices of the target elements, in the whole df, we have to add 7 (actually, N+1) to each column index:

x2 = x1 + [0, N+1]

The result is:

[[ 0 11]
 [ 2 10]
 [ 4  8]
 [ 7 11]
 [ 8 12]
 [ 9  7]]

And to get the result (for v == 1), execute:

df.values[tuple(x2.T)]

The result is:

array([25, 14, 52, 75, 86, 51], dtype=object)

Alternative: If you want the above result in a single instruction, run:

df.values[tuple((np.argwhere(df.iloc[:, 1:N+1].values == v) + [0, N+1]).T)]

The procedure described above gives result for v == 1. It is up to you how to assemble results of each pass (for v = 1..9) into the final result. You didn't decribe this detail in your question (or I failed to see and understand it).

One of possible solutions is:

pd.DataFrame([ df.values[tuple((np.argwhere(df.iloc[:, 1:N+1].values
    == v) + [0, N+1]).T)].tolist() for v in range(1,10) ],
    index=range(1,10)).fillna('-')

giving the following result:

    0   1   2   3   4   5   6   7   8   9
1  25  14  52  75  86  51   -   -   -   -
2  24  31  41  61  71  81   -   -   -   -
3  22  32  12  43  51  11  62  72  82  52
4  23  33  13  44  53  63  73  83  54   -
5  35  65  55   -   -   -   -   -   -   -
6  34  42  64  85  53   -   -   -   -   -
7  11  45  54   -   -   -   -   -   -   -
8   -   -   -   -   -   -   -   -   -   -
9  21  74  84   -   -   -   -   -   -   -

Index values are taken from the current value of v. It is up to you whether you are happy about default column names (consecutive numbers from 0).

Additional remark: Remove apostrophes surrounding values in the first column (e.g. change 'foo' to just foo). Otherwise these apostrophes are part of the column content, and I suppose you don't want it. Note that e.g. in the first row of your source column names are without apostrophes and read_csv is clever enough to recognize them as string values.

EDIT - SOLUTION 2

Another, maybe simpler solution:

As we operate on the underlying NumPy table, instead of .values in a number of points, start with:

tbl = df.values

Then, for a single v value, instead of argwhere use nonzero:

tbl[:, N+1:][np.nonzero(tbl[:, 1:N+1] == v)]

Details:

  • tbl[:, 1:N+1] - the slice for m... columns.
  • np.nonzero(tbl[:, 1:N+1] == v) - a tuple of lists - indices of the "wanted" elements, grouped by axis, so it can be directly used in indexation.
  • tbl[:, N+1:] - the slice for x<i> columns.

An important difference between nonzero and argwhere is that nonzero returns a tuple so adding of a "shift" value to the column number is more difficult, so I decided to take a different slice (for x<i> columns) instead.

于 2019-05-25T15:35:24.677 回答