python - 是否有更有效的方法来检索包含列表中值的列表列的行？（子集、联合或超集）

Question

使用 pandas.dataframe，例如：

<class 'pandas.core.frame.DataFrame'>
Index: 685 entries, 7789285 to 8009947
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   sourcedId         685 non-null    string             
 1   status            685 non-null    string             
 2   dateLastModified  685 non-null    datetime64[ns, UTC]
 3   username          685 non-null    string             
 4   userIds           685 non-null    object             
 5   enabledUser       685 non-null    string             
 6   givenName         685 non-null    string             
 7   familyName        685 non-null    string             
 8   middleName        685 non-null    string             
 9   role              685 non-null    string             
 10  identifier        685 non-null    string             
 11  email             685 non-null    string             
 12  sms               685 non-null    string             
 13  phone             685 non-null    string             
 14  agents            685 non-null    object             
 15  orgs              685 non-null    object             
 16  grades            685 non-null    object             
 17  password          685 non-null    string             
dtypes: datetime64[ns, UTC](1), object(4), string(13)
memory usage: 101.7+ KB
df.head()

'grades' 列包含作为字符串的整数列表，即 ['9','10']。我可以通过过滤单个值

mask = df.grades.apply(lambda x: '10' in x)

在我的测试数据集中，它是从我手动填充的列表列表中创建的，我使用了整数值，所以下面的工作正常（？）（为了论证，假设数据是整数列表而不是整数列表字符串）

gradeList = [9,10]
mask = df.grades.apply(lambda x: any(map(lambda x,y: x==y,x gradeList)))
df[mask].head()

我对 Python 比较陌生（在过去的五年中，我已经积累了我认为大约 6 到 8 个月的 Python 经验，如果那样的话）并且对 Pandas 完全陌生。我对列表理解和地图功能只有初步的了解。

我的本意是让我能够检索成绩列中存在成绩列表子集的任何记录。对于Grade中的单个整数，这是通过以下方式完成的：

mask = df.grades.apply(lambda x: grade in x)

我没有使用上述嵌套的 lambda 和映射来实现我的目标，而是创建了一些查询参数 ( gradesList ) 中的术语顺序很重要的东西。下面是我的测试脚本的输出，它对输出中包含的测试数据进行操作。我试图不假设任何一个列表的顺序......

--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   id         10 non-null     object
 1   email      10 non-null     object
 2   fullName   10 non-null     object
 3   jobTitles  10 non-null     object
 4   grades     10 non-null     object
dtypes: object(5)
memory usage: 528.0+ bytes
--------------------------------------------------------------------------------
          id                 email          fullName                                          jobTitles           grades
0    smithsm    smithsm@aplace.com         Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
1   mullenjb   mullenjb@aplace.com      Jason Mullen               [printer guy, supervisor, senior it]         [11, 12]
2    swainrl    swainrl@aplace.com        Ryan Swain                      [nap taker, goof-off, goober]          [9, 10]
3  rankinsns  rankinsns@aplace.com  Nicholas Rankins                           [manual tesla autopilot]          [9, 10]
4  carlsonrm  carlsonrm@aplace.com      Ryan Carlson                     [technician, snarky so-and-so]         [10, 11]
5     ragomv     ragomv@aplace.com         Mike Rago                                  [nice guy, swole]             [10]
6    smithdl    smithdl@aplace.com       David Smith                                         [old hand]              [9]
7  kappleraj  kappleraj@aplace.com   Allison Kappler      [girl coder, definitely not prettier than me]             [11]
8   iresonss   iresonss@aplace.com      Sandy Ireson                                      [hard worker]             [12]
9  conklincc  conklincc@aplace.com     Caleb Conklin                              [millenial magnum pi]          [12, 9]
--------------------------------------------------------------------------------
query for 'developer'
        id               email   fullName                                          jobTitles           grades
0  smithsm  smithsm@aplace.com  Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
--------------------------------------------------------------------------------
query for 11
          id                 email         fullName                                          jobTitles           grades
0    smithsm    smithsm@aplace.com        Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
1   mullenjb   mullenjb@aplace.com     Jason Mullen               [printer guy, supervisor, senior it]         [11, 12]
4  carlsonrm  carlsonrm@aplace.com     Ryan Carlson                     [technician, snarky so-and-so]         [10, 11]
7  kappleraj  kappleraj@aplace.com  Allison Kappler      [girl coder, definitely not prettier than me]             [11]
--------------------------------------------------------------------------------
query for 10
          id                 email          fullName                                          jobTitles           grades
0    smithsm    smithsm@aplace.com         Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
2    swainrl    swainrl@aplace.com        Ryan Swain                      [nap taker, goof-off, goober]          [9, 10]
3  rankinsns  rankinsns@aplace.com  Nicholas Rankins                                       [technician]          [9, 10]
4  carlsonrm  carlsonrm@aplace.com      Ryan Carlson                     [technician, snarky so-and-so]         [10, 11]
5     ragomv     ragomv@aplace.com         Mike Rago                                  [nice guy, swole]             [10]
--------------------------------------------------------------------------------
query for 11,12
          id                 email         fullName                                      jobTitles    grades
1   mullenjb   mullenjb@aplace.com     Jason Mullen           [printer guy, supervisor, senior it]  [11, 12]
7  kappleraj  kappleraj@aplace.com  Allison Kappler  [girl coder, definitely not prettier than me]      [11]
--------------------------------------------------------------------------------
query for 10,11
          id                 email      fullName                       jobTitles    grades
4  carlsonrm  carlsonrm@aplace.com  Ryan Carlson  [technician, snarky so-and-so]  [10, 11]
5     ragomv     ragomv@aplace.com     Mike Rago               [nice guy, swole]      [10]
--------------------------------------------------------------------------------
query for 9,10
          id                 email          fullName                                          jobTitles           grades
0    smithsm    smithsm@aplace.com         Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
2    swainrl    swainrl@aplace.com        Ryan Swain                      [nap taker, goof-off, goober]          [9, 10]
3  rankinsns  rankinsns@aplace.com  Nicholas Rankins                                       [technician]          [9, 10]
6    smithdl    smithdl@aplace.com       David Smith                                         [old hand]              [9]
--------------------------------------------------------------------------------
query for 10,9
          id                 email       fullName                       jobTitles    grades
4  carlsonrm  carlsonrm@aplace.com   Ryan Carlson  [technician, snarky so-and-so]  [10, 11]
5     ragomv     ragomv@aplace.com      Mike Rago               [nice guy, swole]      [10]
9  conklincc  conklincc@aplace.com  Caleb Conklin           [millenial magnum pi]   [12, 9]

是否有人能够识别（希望是我缺少的核心概念）或指向我的文档来帮助我解开正在发生的事情？

score 1 · Accepted Answer

我使用了一个更轻量级的数据框：

>>> df
          id           grades
0    smithsm     [1, 9, 2, 6]  # <- 9
1   mullenjb  [1, 5, 8, 4, 7]
2    swainrl        [4, 2, 9]  # <- 9
3  rankinsns           [5, 2]
4  carlsonrm  [7, 4, 6, 3, 2]  # <- 3
5     ragomv        [6, 1, 5]
6    smithdl  [2, 9, 6, 7, 3]  # <- 3 & 9
7  kappleraj        [9, 5, 8]  # <- 9
8   iresonss  [8, 6, 7, 5, 4]
9  conklincc           [8, 6]

如何找到成绩单[3, 9]？

展开你的列grades，发现成绩在成绩列表中。

>>> df.loc[df['grades'].explode().isin([3, 9]).groupby(level=0).any()
          id           grades
0    smithsm     [1, 9, 2, 6]
2    swainrl        [4, 2, 9]
4  carlsonrm  [7, 4, 6, 3, 2]
6    smithdl  [2, 9, 6, 7, 3]
7  kappleraj        [9, 5, 8]

如同：

>>> df.loc[df['grades'].explode() \
       .apply(lambda x: x in [3, 9]) \
      .groupby(level=0).any()]`

python - 是否有更有效的方法来检索包含列表中值的列表列的行？（子集、联合或超集）

1 回答 1

Related

Reference