• AdaCode.io
  • Posts
  • How to Create a Histogram in Matplotlib?

How to Create a Histogram in Matplotlib?

Histogram in Matplotlib

Histograms are used in data visualization to show the distribution of numerical data. It is used to summarize discrete or continuous data. Each bar in a histogram represents the frequency (the number of occurrences ) of data points within a specific range of values called a bin.

Histograms are particularly useful for understanding the shape of the data such as whether the distribution is symmetric or skewed or if there are any outliers or unusual patterns in the data. Histograms allow us to quickly summarize large datasets and identify patterns and potential issues within the data.

How to Create a Histogram in Matplotlib?

To create a histogram in matplotlib we use the plt.hist() function. This function takes a required parameter x which is the values of the variable for which you want to create a histogram. x can be either a single array or a sequence of arrays that are not required to be of the same length.

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

# read Nvidia Share market data
nvidia = pd.read_csv('../data/NVDA.csv', parse_dates=['Date'])
nvidia.head()

Nvidia stock market data

Let’s create a histogram of Adj Close price column.

# create a histogram
plt.figure(figsize=(8, 6))
plt.hist(nvidia['Adj Close'])
plt.title('Histogram of Nvidia Adj Close Prices')
plt.xlabel('Adjusted Close Price')
plt.ylabel('Frequency')
plt.show()

Histogram of Nvidia close prices

We can see that the data is right-skewed, most of the values are between around $40 to $300 and then there is a long tail on the right side of the plot. To get a better sense of the data you can supplement the histogram with summary statistics which can be calculated easily using the pandas describe method.

# summary statistics
nvidia['Adj Close'].describe()

Summary statistics

The minimum close price is $33 and the maximum is $739. the mean close price is around $192 and the median is $153.

Bin Size

The bin size of the histogram can be adjusted using the bins parameter. By default, the bin size is 10. let’s try bin sizes of 20 and 30.

# histogram with custom bin size
for bin_size in [20, 30]:
    plt.hist(nvidia['Adj Close'], bins=bin_size, label=f'Bin Size= {bin_size}')
    plt.title('Histogram of Nvidia Adj Close Prices')
    plt.xlabel('Adjusted Close Price')
    plt.ylabel('Frequency')
    plt.legend()
    plt.show()

Histogram with bin size=20

Histogram with bin size=30

Color

For changing the color as usual we will use the color parameter. Since the brand color of Nvidia is green, let’s use that.

plt.figure(figsize=(8, 6))
plt.hist(nvidia['Adj Close'], color='seagreen', bins=30)
plt.title('Histogram of Nvidia Adj Close Prices')
plt.xlabel('Adjusted Close Price')
plt.ylabel('Frequency')
plt.show()

Histogram with custom color

Multiple Histogram

You can also create multiple histograms on the same plot. let’s read the Framingham Heart Study data to illustrate this.

framingham = pd.read_csv('../data/framingham.csv')
framingham.head()

Framingham Heart Study data

Now, Let’s plot a histogram of Systolic blood pressure for Male and Female.

male_df = framingham[framingham['sex']=='Male']
female_df = framingham[framingham['sex']=='Female']

# Plot histograms for Systolic blood pressure for Male and Female
plt.figure(figsize=(8, 6))
plt.hist(male_df['sysBP'], bins=30, label='Male', color='tab:blue')       # Male
plt.hist(female_df['sysBP'], bins=30, label='Female', color='tab:red')    # Female
plt.title('Histogram of Systolic Blood Pressure by Gender')
plt.xlabel('Systolic Blood Pressure (sysBP)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Systolic blood pressure of Male and Female

Although we plotted the separate bars for males and females, the males’ data was hidden behind the females. There are several methods we can take to solve this problem. One simple solution is to use the alpha parameter to add transparency to the plot.

# histograms of Systolic blood pressure for Male and Female
plt.figure(figsize=(8, 6))
plt.hist(male_df['sysBP'], bins=30, label='Male', color='tab:blue') 
plt.hist(female_df['sysBP'], bins=30, alpha=0.5, label='Female', color='tab:red')    
plt.title('Histogram of Systolic Blood Pressure by Gender')
plt.xlabel('Systolic Blood Pressure (sysBP)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Histogram with alpha parameter

Histogram Types

Another method for solving this problem is using the histtype parameter in plt.hist(). Setting it to step will generate a line plot that is by default unfilled. The other histtype is bar(default) which is a traditional bar-type histogram, barstacked is a bar-type histogram where multiple data are stacked on top of each other and stepfilled is another variation of step but by default, it is filled.

# histograms for Systolic blood pressure for Male and Female
plt.figure(figsize=(8, 6))

plt.hist(male_df['sysBP'], bins=30, histtype='step', 
         label='Male', color='blue') 

plt.hist(female_df['sysBP'], bins=30, histtype='step', 
         label='Female', color='red')  

plt.title('Histogram of Systolic Blood Pressure by Gender')
plt.xlabel('Systolic Blood Pressure (sysBP)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Histogram with step hist type

density

The plt.hist() function also has a parameter called density that plots probability density instead of frequency. When we use frequency the height of each bar represents the number of observations within each bin. But when we use probability density, matplotlib normalizes the histogram such that the area under the histogram sums to 1 which means that the height of each bar will now reflect the probability of observations falling within each bin, relative to the total dataset.

The density parameter is useful when you want to compare the shape of distribution rather than absolute counts, especially when dealing with datasets of different sizes.

# histograms of sysBP for Male and Female with density=True
plt.figure(figsize=(8, 6))

plt.hist(male_df['sysBP'], bins=30, histtype='step', 
         label='Male', color='blue', density=True)

plt.hist(female_df['sysBP'], bins=30, histtype='step', 
         label='Female', color='red', density=True)

plt.title('Normalized Histogram of Systolic Blood Pressure by Gender')
plt.xlabel('Systolic Blood Pressure (sysBP)')
plt.ylabel('Density')
plt.legend()
plt.show()

Normalized Histogram

Orientation

The orientation parameter in plt.hist() specifies the orientation of the histogram bars. By default, histograms are plotted vertically, but you can change this behavior by setting orientation=’horizontal’ which rotates the histogram so that the bars extend horizontally from the y-axis towards higher counts or densities on the x-axis.

# horizontal histogram
plt.figure(figsize=(8, 6))

plt.hist(male_df['sysBP'], bins=30, histtype='step', 
         label='Male', color='blue', orientation='horizontal')

plt.hist(female_df['sysBP'], bins=30, histtype='step', 
         label='Female', color='red', orientation='horizontal')

plt.title('Horizontal Histogram of Systolic Blood Pressure by Gender')
plt.xlabel('Frequency')
plt.ylabel('Systolic Blood Pressure (sysBP)')
plt.legend()
plt.show()

Horizontal Histogram

2D Histogram

A 2D histogram is used to represent the joint distribution of two variables by dividing the plane into bins and counting the number of observations in each bin. It is useful for visualizing the relationship between two variables, similar to a scatter plot but with a focus on the density of points.

To create a 2D histogram in matplotlib use the plt.hist2d() function. Let’s create a 2D histogram to visualize the relationship between systolic and diastolic blood pressure.

# 2D histogram of blood pressure
plt.hist2d(framingham['sysBP'], framingham['diaBP'], bins=30)
plt.colorbar() 
plt.xlabel('Systolic Blood Pressure')
plt.ylabel('Diastolic Blood Pressure')
plt.title('2D Histogram of Blood Pressure')
plt.show()

2D Histogram of blood pressure

Exercise 4.1

  1. Create a Histogram of BMI (Body Mass Index)

  2. Create a Multiple Histogram of BMI based on Gender.

  3. Apply various strategies to rectify the plot if there is too much overlap.

  4. Create a density histogram BMI.

  5. Create a 2D histogram of BMI and totChol (Total cholesterol).

Summary

  • To create a histogram in Matplotlib use plt.hist() function.

  • To change the bin size use the bins parameter.

  • To change the color of the histogram use color parameter.

  • To create another histogram on the same plot add another plt.hist() function.

  • Use the alpha parameter to add transparency to the histogram plot.

  • Use the histtype to create different types of histograms.

  • Use density parameter to create a density histogram.

  • The orientation='horizontal' parameter is used to create a horizontal histogram.

  • To create a 2D histogram use the plt.hist2d() function.

Solution

Exercise 4.1

# 1. Create a Histogram of BMI
plt.figure(figsize=(8, 6))
plt.hist(framingham['BMI'], color='crimson')
plt.title('Histogram of Body Mass Index')
plt.xlabel('Body Mass Index')
plt.ylabel('Frequency')
plt.show()

Histogram of body mass index

# 2. Create a Multiple Histogram of BMI based on Gender.
male_df = framingham[framingham['sex']=='Male']
female_df = framingham[framingham['sex']=='Female']

plt.figure(figsize=(8, 6))
plt.hist(male_df['BMI'], color='tab:blue', label='Male')
plt.hist(female_df['BMI'], color='seagreen', label='Female')
plt.title('Histogram of BMI by Gender')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Histogram of BMI by Gender

# 3.Rectify the problem of overlapping by using alpha

plt.figure(figsize=(8, 6))
plt.hist(male_df['BMI'], color='tab:blue', label='Male')
plt.hist(female_df['BMI'], color='seagreen', alpha=0.5, label='Female')
plt.title('Histogram of BMI by Gender')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Histogram with alpha parameter

# 3. Rectify the problem of overlapping by changing histogram type

plt.figure(figsize=(8, 6))

plt.hist(male_df['BMI'], color='tab:blue', 
         histtype='step', label='Male')

plt.hist(female_df['BMI'], color='seagreen', 
         histtype='step', label='Female')

plt.title('Histogram of BMI by Gender')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Histogram with step hist type

# 4. Create a density histogram.
plt.figure(figsize=(8, 6))

plt.hist(male_df['BMI'], color='crimson', density=True,
         histtype='step', label='Male')

plt.hist(female_df['BMI'], color='green', density=True,
         histtype='step', label='Female')

plt.title('Histogram of BMI by Gender')
plt.xlabel('BMI')
plt.ylabel('Density')
plt.legend()
plt.show()

density histogram

# 5. Create a 2D histogram of BMI and totChol 

plt.figure(figsize=(8, 6))
plt.hist2d(framingham['BMI'], framingham['totChol'])
plt.colorbar()
plt.title('2D Histogram of BMI and Total Cholesterol')
plt.xlabel('BMI')
plt.ylabel('Total Cholesterol')
plt.show()

2D histogram of BMI and Cholesterol

Subscribe

If you liked this post then please subscribe to our Newsletter to get more in-depth articles on data science and programming.