0

Context

I'm trying to find outliers in all columns of a dataframe with python.

Steps:

  1. Created a function to find outliers via IQR
  2. Tested the function on one column.
  3. Implemented the function on all columns with a for loop.

My level

I'm completely new to Machine learning and data science. I only know python and pandas so I'm currently expanding my knowledge in machine learning. I don't know a lot of theory about which data types machine learning algorithms can handle and why missing values are a problem, etc.

Overview of the data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2768 entries, 14421 to 98025
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   date                   2768 non-null   datetime64[ns]
 1   location               2768 non-null   object        
 2   new_deaths             2768 non-null   float64       
 3   female_smokers         2768 non-null   float64       
 4   male_smokers           2768 non-null   float64       
 5   population             2768 non-null   float64       
 6   people_vaccinated      2768 non-null   float64       
 7   cardiovasc_death_rate  2768 non-null   float64       
 8   aged_65_older          2768 non-null   float64       
 9   gdp_per_capita         2768 non-null   float64     
..... #The rest are indicator columns with dummy values that were categorical columns before.  
dtypes: datetime64[ns](1), float64(8), object(1)

Code to find outliers in one column

I created a function to find the IQR and will return the indices and values of the outliers.

def find_outliers_tukey(x):
  q1 = np.percentile(x, 25)
  q3 = np.percentile(x, 75)

  iqr = q3-q1
  floor = q1 -1.5*iqr
  ceiling = q3 +1.5*iqr

  outlier_indices = list(x.index[ (x < floor)|(x > ceiling) ])
  outlier_values = list(x[outlier_indices])

  return outlier_indices, outlier_values

When I call the function:

tukey_indices, tukey_values = find_outliers_tukey(df.new_deaths)
print(f"Outliers in new deatths are {np.sort(tukey_values)}")

output:

Outliers in new deatths are []

Question 1

Why is this giving me no outliers? Look below

# Statistics of the new deaths column

Mean = 145.745266
std = 796.284067    
min = -1918.000000
25% = 0.000000
50% = 2.000000
75% = 18.000000
max = 18000.000000

Note: Looking at the stats, there's probably something seriously wrong with the data

Code to find outliers in all columns (for loop)

for feature in df.columns:
  tukey_indices, tukey_values = find_outliers_tukey(feature)
  print(f"Outliers in {feature} are {tukey_values} \n")

output:

UFuncTypeError                            Traceback (most recent call last)
<ipython-input-16-b01dad9e55a2> in <module>()
      1 for feature in df.columns:
----> 2   tukey_indices, tukey_values = find_outliers_tukey(feature)
      3   print(f"Outliers in {feature} are {tukey_values} \n")

4 frames
<__array_function__ internals> in percentile(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py in _quantile_ureduce_func(a, q, axis, out, overwrite_input, interpolation, keepdims)
   3965             n = np.isnan(ap[-1:, ...])
   3966 
-> 3967         x1 = take(ap, indices_below, axis=axis) * weights_below
   3968         x2 = take(ap, indices_above, axis=axis) * weights_above
   3969 

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')

Question 2

What does this error mean/ why am I getting this?


For Question 1, your code seems to work fine on my end, but of course I don't have your original data.

For Question 2, there are two problems. The first is that you are passing the column names to find_outliers_tukey instead of the columns themselves. Use iteritems to iterate over pairs of (column name, column Series):

for feature, column in df.iteritems():
    tukey_indices, tukey_values = find_outliers_tukey(column)
    print(f"Outliers in {feature} are {tukey_values} \n")

The second problem, which you'll run into after solving the first problem, is that your location column is not a column with, so you won't be able to find outliers for it. Make sure to only iterate over the columns that you actually want to perform the calculation on.

4

2 回答 2

0

The problem was probably with the numpy function 'percentile' and how I passed in my argument to the find_outliers_tukey function. So these changes worked for me

step 1

  1. Include two arguments; one for the name of df, another for the name of the feature.
  2. Put the feature argument into the df explicitly.
  3. Don't use attribute chaining when accessing the feature and use quantile instead of percentile.
def find_outliers_tukey(df:"dataframe", feature:"series") -> "list, list":
  "write later"

  q1 = df[feature].quantile(0.25)
  q3 = df[feature].quantile(0.75)

  iqr = q3-q1
  floor = q1 -1.5*iqr
  ceiling = q3 +1.5*iqr

  outlier_indices = list(df.index[ (df[feature] < floor) | (df[feature] > ceiling) ])
  #outlier_values = list(df[feature][outlier_indices]) 

  #print(f"outliers are {outlier_values} at indices {outlier_indices}")
  #return outlier_indices, outlier_values
  return outlier_indices

step 2

I put all the columns I wanted to remove outliers from into a list.

df_columns = list(df.columns[1:56])

step 3

no change here. Just used 2 arguments instead of 1 for the find_outliers_tukey function. Oh and I stored the indices of the outliers just for future use.

index_list = []

for feature in df_columns: 
  index_list.extend(find_outliers_tukey(df, feature))

This gave me better statistical results for the columns.

于 2021-09-17T13:11:56.360 回答
0

对于问题 1,您的代码对我来说似乎可以正常工作,但我当然没有您的原始数据。

对于问题 2,有两个问题。第一个是您将列传递给find_outliers_tukey而不是列本身。用于iteritems迭代对(column name, column Series)

for feature, column in df.iteritems():
    tukey_indices, tukey_values = find_outliers_tukey(column)
    print(f"Outliers in {feature} are {tukey_values} \n")

在解决第一个问题后您将遇到的第二个问题是您的location列不是一个列,因此您将无法找到它的异常值。确保仅迭代您实际要对其执行计算的列。

于 2021-09-17T03:59:44.977 回答