python - ufunc 'multiply' did not contain a loop with signature matching types (dtype(' dtype('
Context

问问题

问问题 2021-09-17T03:16:33.793

777 次

0

Context

I'm trying to find outliers in all columns of a dataframe with python.

Steps:

Created a function to find outliers via IQR

Tested the function on one column.

Implemented the function on all columns with a for loop.

My level

I'm completely new to Machine learning and data science. I only know python and pandas so I'm currently expanding my knowledge in machine learning. I don't know a lot of theory about which data types machine learning algorithms can handle and why missing values are a problem, etc.

Overview of the data

<class 'pandas.core.frame.DataFrame'> Int64Index: 2768 entries, 14421 to 98025 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 2768 non-null datetime64[ns] 1 location 2768 non-null object 2 new_deaths 2768 non-null float64 3 female_smokers 2768 non-null float64 4 male_smokers 2768 non-null float64 5 population 2768 non-null float64 6 people_vaccinated 2768 non-null float64 7 cardiovasc_death_rate 2768 non-null float64 8 aged_65_older 2768 non-null float64 9 gdp_per_capita 2768 non-null float64 ..... #The rest are indicator columns with dummy values that were categorical columns before. dtypes: datetime64[ns](1), float64(8), object(1)

Code to find outliers in one column

I created a function to find the IQR and will return the indices and values of the outliers.

`def find_outliers_tukey(x): q1 = np.percentile(x, 25) q3 = np.percentile(x, 75) iqr = q3-q1 floor = q1 -1.5iqr ceiling = q3 +1.5iqr outlier_indices = list(x.index[ (x < floor)|(x > ceiling) ]) outlier_values = list(x[outlier_indices]) return outlier_indices, outlier_values`

When I call the function:

`tukey_indices, tukey_values = find_outliers_tukey(df.new_deaths) print(f"Outliers in new deatths are {np.sort(tukey_values)}")`

output:

`Outliers in new deatths are []`

Question 1

Why is this giving me no outliers? Look below

`# Statistics of the new deaths column Mean = 145.745266 std = 796.284067 min = -1918.000000 25% = 0.000000 50% = 2.000000 75% = 18.000000 max = 18000.000000`

Note: Looking at the stats, there's probably something seriously wrong with the data

Code to find outliers in all columns (for loop)

`for feature in df.columns: tukey_indices, tukey_values = find_outliers_tukey(feature) print(f"Outliers in {feature} are {tukey_values} \n")`

output:

UFuncTypeError Traceback (most recent call last) <ipython-input-16-b01dad9e55a2> in <module>() 1 for feature in df.columns: ----> 2 tukey_indices, tukey_values = find_outliers_tukey(feature) 3 print(f"Outliers in {feature} are {tukey_values} \n") 4 frames <__array_function__ internals> in percentile(*args, **kwargs) /usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py in _quantile_ureduce_func(a, q, axis, out, overwrite_input, interpolation, keepdims) 3965 n = np.isnan(ap[-1:, ...]) 3966 -> 3967 x1 = take(ap, indices_below, axis=axis) * weights_below 3968 x2 = take(ap, indices_above, axis=axis) * weights_above 3969 UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')

Question 2

What does this error mean/ why am I getting this?

For Question 1, your code seems to work fine on my end, but of course I don't have your original data.

For Question 2, there are two problems. The first is that you are passing the column names to `find_outliers_tukey` instead of the columns themselves. Use `iteritems` to iterate over pairs of `(column name, column Series)`:

`for feature, column in df.iteritems(): tukey_indices, tukey_values = find_outliers_tukey(column) print(f"Outliers in {feature} are {tukey_values} \n")`

The second problem, which you'll run into after solving the first problem, is that your `location` column is not a column with, so you won't be able to find outliers for it. Make sure to only iterate over the columns that you actually want to perform the calculation on.

python pandas data-science outliers iqr

4

2 回答 2

0

The problem was probably with the numpy function 'percentile' and how I passed in my argument to the find_outliers_tukey function. So these changes worked for me

step 1

Include two arguments; one for the name of df, another for the name of the feature.

Put the feature argument into the df explicitly.

Don't use attribute chaining when accessing the feature and use quantile instead of percentile.

`def find_outliers_tukey(df:"dataframe", feature:"series") -> "list, list": "write later" q1 = df[feature].quantile(0.25) q3 = df[feature].quantile(0.75) iqr = q3-q1 floor = q1 -1.5iqr ceiling = q3 +1.5iqr outlier_indices = list(df.index[ (df[feature] < floor) | (df[feature] > ceiling) ]) #outlier_values = list(df[feature][outlier_indices]) #print(f"outliers are {outlier_values} at indices {outlier_indices}") #return outlier_indices, outlier_values return outlier_indices`

step 2

I put all the columns I wanted to remove outliers from into a list.

`df_columns = list(df.columns[1:56])`

step 3

no change here. Just used 2 arguments instead of 1 for the find_outliers_tukey function. Oh and I stored the indices of the outliers just for future use.

`index_list = [] for feature in df_columns: index_list.extend(find_outliers_tukey(df, feature))`

This gave me better statistical results for the columns.

于 2021-09-17T13:11:56.360 回答

0

对于问题 1，您的代码对我来说似乎可以正常工作，但我当然没有您的原始数据。

对于问题 2，有两个问题。第一个是您将列名传递给`find_outliers_tukey`而不是列本身。用于`iteritems`迭代对`(column name, column Series)`：

`for feature, column in df.iteritems(): tukey_indices, tukey_values = find_outliers_tukey(column) print(f"Outliers in {feature} are {tukey_values} \n")`

在解决第一个问题后您将遇到的第二个问题是您的`location`列不是一个列，因此您将无法找到它的异常值。确保仅迭代您实际要对其执行计算的列。

于 2021-09-17T03:59:44.977 回答

Related

0
python - 使用 python stomp api 接收消息

2
image - iOS 5.1.1，html5 图像模糊

1
java - 如何在Java中高效地一连串声音？

2
emacs - Emacs：当我尝试将数字命名的宏保存到 init 文件时出现“不匹配”错误

3
html - 表格底部边框未出现

2
apache - 另一个“svn：存储库永久移动以请重新定位”问题

3
objective-c - 检查数组是否有 if 语句不起作用的数据？

3
heroku - 在 Heroku 上测试 PostgreSQL PostGIS 的选项？

1
r - 带有 ggplot 的 R 样式轴

2
perl - 无法调用方法状态 - 错误

Reference

php × 1429865

c/c++ × 756500

nginx × 49975

mongodb × 159057

mybatis × 3233

anaconda × 13410

pycharm × 14671

python × 1902243

vscode × 56040

docker × 110988

github × 49000

flask × 49129

ffmpeg × 24037

jmeter × 16910

matplotlib × 63493

bootstrap × 54641

Question

Context

I'm trying to find outliers in all columns of a dataframe with python.

Steps:

Created a function to find outliers via IQR
Tested the function on one column.
Implemented the function on all columns with a for loop.

My level

I'm completely new to Machine learning and data science. I only know python and pandas so I'm currently expanding my knowledge in machine learning. I don't know a lot of theory about which data types machine learning algorithms can handle and why missing values are a problem, etc.

Overview of the data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2768 entries, 14421 to 98025
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   date                   2768 non-null   datetime64[ns]
 1   location               2768 non-null   object        
 2   new_deaths             2768 non-null   float64       
 3   female_smokers         2768 non-null   float64       
 4   male_smokers           2768 non-null   float64       
 5   population             2768 non-null   float64       
 6   people_vaccinated      2768 non-null   float64       
 7   cardiovasc_death_rate  2768 non-null   float64       
 8   aged_65_older          2768 non-null   float64       
 9   gdp_per_capita         2768 non-null   float64     
..... #The rest are indicator columns with dummy values that were categorical columns before.  
dtypes: datetime64[ns](1), float64(8), object(1)

Code to find outliers in one column

I created a function to find the IQR and will return the indices and values of the outliers.

def find_outliers_tukey(x):
  q1 = np.percentile(x, 25)
  q3 = np.percentile(x, 75)

  iqr = q3-q1
  floor = q1 -1.5*iqr
  ceiling = q3 +1.5*iqr

  outlier_indices = list(x.index[ (x < floor)|(x > ceiling) ])
  outlier_values = list(x[outlier_indices])

  return outlier_indices, outlier_values

When I call the function:

tukey_indices, tukey_values = find_outliers_tukey(df.new_deaths)
print(f"Outliers in new deatths are {np.sort(tukey_values)}")

output:

Outliers in new deatths are []

Question 1

Why is this giving me no outliers? Look below

# Statistics of the new deaths column

Mean = 145.745266
std = 796.284067    
min = -1918.000000
25% = 0.000000
50% = 2.000000
75% = 18.000000
max = 18000.000000

Note: Looking at the stats, there's probably something seriously wrong with the data

Code to find outliers in all columns (for loop)

for feature in df.columns:
  tukey_indices, tukey_values = find_outliers_tukey(feature)
  print(f"Outliers in {feature} are {tukey_values} \n")

output:

UFuncTypeError                            Traceback (most recent call last)
<ipython-input-16-b01dad9e55a2> in <module>()
      1 for feature in df.columns:
----> 2   tukey_indices, tukey_values = find_outliers_tukey(feature)
      3   print(f"Outliers in {feature} are {tukey_values} \n")

4 frames
<__array_function__ internals> in percentile(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py in _quantile_ureduce_func(a, q, axis, out, overwrite_input, interpolation, keepdims)
   3965             n = np.isnan(ap[-1:, ...])
   3966 
-> 3967         x1 = take(ap, indices_below, axis=axis) * weights_below
   3968         x2 = take(ap, indices_above, axis=axis) * weights_above
   3969 

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')

Question 2

What does this error mean/ why am I getting this?

score 0 · Accepted Answer

The problem was probably with the numpy function 'percentile' and how I passed in my argument to the find_outliers_tukey function. So these changes worked for me

step 1

Include two arguments; one for the name of df, another for the name of the feature.
Put the feature argument into the df explicitly.
Don't use attribute chaining when accessing the feature and use quantile instead of percentile.

def find_outliers_tukey(df:"dataframe", feature:"series") -> "list, list":
  "write later"

  q1 = df[feature].quantile(0.25)
  q3 = df[feature].quantile(0.75)

  iqr = q3-q1
  floor = q1 -1.5*iqr
  ceiling = q3 +1.5*iqr

  outlier_indices = list(df.index[ (df[feature] < floor) | (df[feature] > ceiling) ])
  #outlier_values = list(df[feature][outlier_indices]) 

  #print(f"outliers are {outlier_values} at indices {outlier_indices}")
  #return outlier_indices, outlier_values
  return outlier_indices

step 2

I put all the columns I wanted to remove outliers from into a list.

df_columns = list(df.columns[1:56])

step 3

no change here. Just used 2 arguments instead of 1 for the find_outliers_tukey function. Oh and I stored the indices of the outliers just for future use.

index_list = []

for feature in df_columns: 
  index_list.extend(find_outliers_tukey(df, feature))

This gave me better statistical results for the columns.

score 0 · Accepted Answer

对于问题 1，您的代码对我来说似乎可以正常工作，但我当然没有您的原始数据。

对于问题 2，有两个问题。第一个是您将列名传递给find_outliers_tukey而不是列本身。用于iteritems迭代对(column name, column Series)：

for feature, column in df.iteritems():
    tukey_indices, tukey_values = find_outliers_tukey(column)
    print(f"Outliers in {feature} are {tukey_values} \n")

在解决第一个问题后您将遇到的第二个问题是您的location列不是一个列，因此您将无法找到它的异常值。确保仅迭代您实际要对其执行计算的列。