python - Overlaying multiple histograms using pandas

Question

I have two or three csv files with the same header and would like to draw the histograms for each column overlaying one another on the same plot.

The following code gives me two separate figures, each containing all histograms for each of the files. Is there a compact way to go about plotting them together on the same figure using pandas/matplot lib? I imagine something close to this but using dataframes.

Code:

import pandas as pd
import matplotlib.pyplot as plt

df =  pd.read_csv('input1.csv')
df2 = pd.read_csv('input2.csv')
df.hist(bins=20)
df2.hist(bins=20)

plt.show()

score 5 · Accepted Answer

In [18]: from pandas import DataFrame

In [19]: from numpy.random import randn

In [20]: df = DataFrame(randn(10, 2))

In [21]: df2 = DataFrame(randn(10, 2))

In [22]: axs = df.hist()

In [23]: for ax, (colname, values) in zip(axs.flat, df2.iteritems()):
   ....:     values.hist(ax=ax, bins=10)
   ....:

In [24]: draw()

gives enter image description here

score 2 · Accepted Answer

The main issue of overlaying the histograms of two (or more) dataframes containing the same variables in side-by-side plots within a single figure has been already solved in the answer by Phillip Cloud.

This answer provides a solution to the issue raised by the author of the question (in the comments to the accepted answer) regarding how to enforce the same number of bins and range for the variables common to both dataframes. This can be accomplished by creating a list of bins common to all variables of both dataframes. In fact, this answer goes a little bit further by adjusting the plots for cases where the different variables contained in each dataframe cover slightly different ranges (but still within the same order of magnitude), as illustrated in the following example:

import numpy as np                   # v 1.19.2
import pandas as pd                  # v 1.1.3
import matplotlib.pyplot as plt      # v 3.3.2
from matplotlib.lines import Line2D

# Set seed for random data
rng = np.random.default_rng(seed=1)

# Create two similar dataframes each containing two random variables,
# with df2 twice the size of df1
df1_size = 1000
df1 = pd.DataFrame(dict(var1 = rng.exponential(scale=1.0, size=df1_size),
                        var2 = rng.normal(loc=40, scale=5, size=df1_size)))
df2_size = 2*df1_size
df2 = pd.DataFrame(dict(var1 = rng.exponential(scale=2.0, size=df2_size),
                        var2 = rng.normal(loc=50, scale=10, size=df2_size)))

# Combine the dataframes to extract the min/max values of each variable
df_combined = pd.concat([df1, df2])
vars_min = [df_combined[var].min() for var in df_combined]
vars_max = [df_combined[var].max() for var in df_combined]

# Create custom bins based on the min/max of all values from both
# dataframes to ensure that in each histogram the bins are aligned
# making them easily comparable
nbins = 30
bin_edges, step = np.linspace(min(vars_min), max(vars_max), nbins+1, retstep=True)

# Create figure by combining the outputs of two pandas df.hist() function
# calls using the 'step' type of histogram to improve plot readability
htype = 'step'
alpha = 0.7
lw = 2
axs = df1.hist(figsize=(10,4), bins=bin_edges, histtype=htype,
               linewidth=lw, alpha=alpha, label='df1')
df2.hist(ax=axs.flatten(), grid=False, bins=bin_edges, histtype=htype,
         linewidth=lw, alpha=alpha, label='df2')

# Adjust x-axes limits based on min/max values and step between bins, and
# remove top/right spines: if, contrary to this example dataset, var1 and
# var2 cover the same range, setting the x-axes limits with this loop is
# not necessary
for ax, v_min, v_max in zip(axs.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

# Edit legend to get lines as legend keys instead of the default polygons:
# use legend handles and labels from any of the axes in the axs object
# (here taken from first one) seeing as the legend box is by default only
# shown in the last subplot when using the plt.legend() function.
handles, labels = axs.flatten()[0].get_legend_handles_labels()
lines = [Line2D([0], [0], lw=lw, color=h.get_facecolor()[:-1], alpha=alpha)
         for h in handles]
plt.legend(lines, labels, frameon=False)

plt.suptitle('Pandas', x=0.5, y=1.1, fontsize=14)
plt.show()

It is worth noting that the seaborn package provides a more convenient way to create this kind of plot, where contrary to pandas the bins are automatically aligned. The only downside is that the dataframes must first be combined and reshaped to long format, as shown in this example using the same dataframes and bins as before:

import seaborn as sns    # v 0.11.0

# Combine dataframes and convert the combined dataframe to long format
df_concat = pd.concat([df1, df2], keys=['df1','df2']).reset_index(level=0)
df_melt = df_concat.melt(id_vars='level_0', var_name='var_id')

# Create figure using seaborn displot: note that the bins are automatically
# aligned thanks the 'common_bins' parameter of the seaborn histplot function
# (called here with 'kind='hist'') that is set to True by default. Here, the
# bins from the previous example are used to make the figures more comparable.
# Also note that the facets share the same x and y axes by default, this can
# be changed when var1 and var2 have different ranges and different
# distribution shapes, as it is the case in this example.
g = sns.displot(df_melt, kind='hist', x='value', col='var_id', hue='level_0',
                element='step', bins=bin_edges, fill=False, height=4,
                facet_kws=dict(sharex=False, sharey=False))

# For some reason setting sharex as above does not automatically adjust the
# x-axes limits (even when not setting a bins argument, maybe due to a bug
# with this package version) which is why this is done in the following loop,
# but note that you still need to set 'sharex=False' in displot, or else
# 'ax.set.xlim' will have no effect.
for ax, v_min, v_max in zip(g.axes.flatten(), vars_min, vars_max):
    ax.set_xlim(v_min-2*step, v_max+2*step)

# Additional formatting
g.legend.set_bbox_to_anchor((.9, 0.75))
g.legend.set_title('')
plt.suptitle('Seaborn', x=0.5, y=1.1, fontsize=14)

plt.show()

As you may notice, the histogram line is cut off at the limits of the list of bin edges (not visible on the maximum side due to scale). To get a line more similar to the example with pandas, an empty bin can be added at each extremity of the list of bins, like this:

bin_edges = np.insert(bin_edges, 0, bin_edges.min()-step)
bin_edges = np.append(bin_edges, bin_edges.max()+step)

This example also illustrates the limits to this approach of setting common bins for both facets. Seeing as the ranges of var1 var2 are somewhat different and that 30 bins are used to cover the combined range, the histogram for var1 contains rather few bins and the histogram for var2 has slightly more bins than necessary. To my knowledge, there is no straightforward way of assigning a different list of bins to each facet when calling the plotting functions df.hist() and displot(df). So for cases where variables cover significantly different ranges, these figures would have to be created from scratch using matplotlib or some other plotting library.

python - Overlaying multiple histograms using pandas

2 回答 2

Related

Reference