Context
I'm trying to find outliers in all columns of a dataframe with python.
Steps:
- Created a function to find outliers via IQR
- Tested the function on one column.
- Implemented the function on all columns with a for loop.
My level
I'm completely new to Machine learning and data science. I only know python and pandas so I'm currently expanding my knowledge in machine learning. I don't know a lot of theory about which data types machine learning algorithms can handle and why missing values are a problem, etc.
Overview of the data
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2768 entries, 14421 to 98025
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 2768 non-null datetime64[ns]
1 location 2768 non-null object
2 new_deaths 2768 non-null float64
3 female_smokers 2768 non-null float64
4 male_smokers 2768 non-null float64
5 population 2768 non-null float64
6 people_vaccinated 2768 non-null float64
7 cardiovasc_death_rate 2768 non-null float64
8 aged_65_older 2768 non-null float64
9 gdp_per_capita 2768 non-null float64
..... #The rest are indicator columns with dummy values that were categorical columns before.
dtypes: datetime64[ns](1), float64(8), object(1)
Code to find outliers in one column
I created a function to find the IQR and will return the indices and values of the outliers.
def find_outliers_tukey(x):
q1 = np.percentile(x, 25)
q3 = np.percentile(x, 75)
iqr = q3-q1
floor = q1 -1.5*iqr
ceiling = q3 +1.5*iqr
outlier_indices = list(x.index[ (x < floor)|(x > ceiling) ])
outlier_values = list(x[outlier_indices])
return outlier_indices, outlier_values
When I call the function:
tukey_indices, tukey_values = find_outliers_tukey(df.new_deaths)
print(f"Outliers in new deatths are {np.sort(tukey_values)}")
output:
Outliers in new deatths are []
Question 1
Why is this giving me no outliers? Look below
# Statistics of the new deaths column
Mean = 145.745266
std = 796.284067
min = -1918.000000
25% = 0.000000
50% = 2.000000
75% = 18.000000
max = 18000.000000
Note: Looking at the stats, there's probably something seriously wrong with the data
Code to find outliers in all columns (for loop)
for feature in df.columns:
tukey_indices, tukey_values = find_outliers_tukey(feature)
print(f"Outliers in {feature} are {tukey_values} \n")
output:
UFuncTypeError Traceback (most recent call last)
<ipython-input-16-b01dad9e55a2> in <module>()
1 for feature in df.columns:
----> 2 tukey_indices, tukey_values = find_outliers_tukey(feature)
3 print(f"Outliers in {feature} are {tukey_values} \n")
4 frames
<__array_function__ internals> in percentile(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py in _quantile_ureduce_func(a, q, axis, out, overwrite_input, interpolation, keepdims)
3965 n = np.isnan(ap[-1:, ...])
3966
-> 3967 x1 = take(ap, indices_below, axis=axis) * weights_below
3968 x2 = take(ap, indices_above, axis=axis) * weights_above
3969
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')
Question 2
What does this error mean/ why am I getting this?