python - 如何修改一列中的多个值，但在pandas python中跳过其他值

Question

在 python 中学习了两个月，我现在正专注于Pandas。在我目前的职位上，我在数据帧上使用 VBA，因此学习它以慢慢取代它并促进我的职业生涯。到目前为止，我相信我真正的问题是缺乏对关键概念的理解。任何帮助将不胜感激。

这就是我的问题：

我可以去哪里了解更多关于如何做这样的事情以获得更精确的过滤。我非常接近，但我需要一个关键方面。

目标

主要目标我需要跳过我的 ID 列中的某些值。 下面的代码去掉了破折号“-”，最多只能读取 9 位数字。但是，我需要跳过某些 ID，因为它们是唯一的。

之后，我将开始比较多张纸。

主数据框 ID 的格式为 000-000-000-000
我将比较它的其他数据帧没有破折号“-”作为 000000000 和三个减去 000 的总九位数。

我需要跳过的唯一 ID 在两个数据帧中是相同的，但格式完全不同，范围为 000-000-000_#12、000-000-000_35 或 000-000-000_z。

我将在每个 ID 上使用的代码（唯一 ID 除外）：

 dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]

但我想使用类似的 if 语句（这不起作用）

lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]

if ~dfSS["ID"].isin(lst ).any()
    dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
    pass

为了更清楚，我的输入 DataFrame 是这样的：

            ID               Street #   Street Name 
0   004-330-002-000         2272        Narnia  
1   021-521-410-000_128     2311        Narnia  
2   001-243-313-000         2235        Narnia  
3   002-730-032-000         2149        Narnia
4   000-000-000_a           1234        Narnia

我希望将其作为输出：

            ID               Street #   Street Name 
0   004330002               2272        Narnia  
1   021-521-410-000_128     2311        Narnia  
2   001243313000            2235        Narnia  
3   002730032000            2149        Narnia
4   000-000-000_a           1234        Narnia

笔记：

dfSS 是我的 Dataframe 变量名，也就是我正在使用的 excel。“ID”是我的列标题。事后将使其成为索引
我在这项工作中的数据框很小，（行、列）的数量为（2500、125）
我没有收到错误消息，所以我猜也许我需要某种循环。也开始测试循环。那里没有运气......但是。

这是我一直在研究这个的地方：

score 1 · Accepted Answer

有很多方法可以做到这一点。这里的第一种方法不涉及编写函数。

# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing

第二种方法是编写一个有条件地转换ID的函数，它没有第一种方法那么快。

def transform_ID(ID_val):
    if ID_val not in lst:
        return ID_val.replace("-", "")[:9]

dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)

score 0 · Accepted Answer

这是基于@xyzxyzjayne 的答案，但我有两个问题我无法弄清楚。

首要问题

我收到这个警告了吗：（见编辑）

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

此警告的文档

您将在下面的代码中看到我尝试放入 .loc 但我似乎无法找出如何通过正确使用 .loc 来消除此警告。还在学习呢。不，即使它有效，我也不会忽略它。我说这是一个学习机会。

第二期

是我不理解这部分代码。我知道左侧应该是行，而右侧是列。那就是说为什么这行得通？当此代码为符文时，ID 是一列而不是一行。我做身份证：

df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]

我还不明白的区域是这部分逗号（，）的左侧：

df.loc[~df["ID "].isin(uniqueID), "ID "]

这就是最终的结果，基本上就像我说它的 XZY 的帮助让我来到了这里，但我正在添加更多的 .locs 并使用文档，直到我可以消除警告。

    uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
 will go in the below code. These ids get skipped. example: "032-234-987_#4256"]

# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
         'Number of Vehicles Removed', 'County']]

#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]

#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]

#Makes the ID our index
df = df.set_index("ID ")

#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")

#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")

一旦我摆脱警告并找出左侧，我将对其进行编辑，以便我可以向需要/看到这篇文章的每个人解释。

编辑：SettingWithCopyWarning：

通过在过滤器之前制作原始数据库的副本并制作 .loc 来修复这个链接索引问题，因为 XYZ 帮助了我。在我们开始过滤之前使用 DataFrame.copy() 其中 DataFrame 是您自己的数据框的名称。

python - 如何修改一列中的多个值，但在pandas python中跳过其他值

目标

笔记：

2 回答 2

首要问题

第二期

编辑：SettingWithCopyWarning：

Related

Reference