Numpy is the main python library used for complex math calculations.
Pandas is built on top of Numpy for working with tables (or dataframes).
Matplotlib is used for plotting charts and graphs and seaborn is for enhanced visualizations, and is built on top of matplotlib.
Statistics is a core python library that can also be used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statistics as stats
Before we proceed, we need some numbers to work with. We can generate random numbers using a function from the numpy library. The function below returns an array of 100 random integers between 1 and 10 (excluding 10).
# Generating an array of random numbers between 1 and 9, of size 100
np.random.randint(1,10,100)
Note: Numpy can also generate random numbers following a normal, and uniform distribution. However, for other types of probability distributions like exponential and Poisson distribution, we'll need an advanced library like SciPy.
Measuring Central Tendencies: Mean, Median, Mode
Numpy has a set of helper functions to calculate mean, median, variance and standard deviation. However, it lacks those for mode, skewness and kurtosis. Pandas on the other hand has all the above. For that reason, it is preferable to use Pandas as it tabulates the data, making it easy to visualize, and it comes with more functions as well.
Note: Pandas uses the same functions as numpy because it is built on top of it.
# Calculating mean, median, mode
df = pd.DataFrame({ # Table with column 'nums'
'nums': [1,2,3]
})
mean = df['nums'].mean()
median = df['nums'].median()
mode = df['nums'].mode()
Measuring Spread & Dispersion
Variance measures how widely values are scattered from the mean, expressed in squared units. Standard deviation on the other hand measures the average distance from the mean, expressed in the same units as the original data. Standard deviation is simply the square root of the variance and even though these metrics are closely related, they differ in interpretation due to their scaling.
# Calculating variance and standard deviation
variance = df['nums'].var()
std_dev = df['nums'].std()
When interpreting these values, consider the following:
Larger values indicate greater dispersion or spread in the dataset.
Smaller values suggest smaller dispersion or less spread (closely clustered).
Dealing with outliers
Sometimes, you may encounter values that are dramatically distant from other values in the dataset. These are outliers and they indicate accidental or deliberate errors and anomalies in the data. It is important to identify and handle outliers as they can distort our analyses (e.g. mean, variance) and lead to inaccurate information.
There are several methods to detect and handle outliers:
Visualization: Plotting histograms, box plots, scatterplot.
The boxplot below indicates the presence of outliers within the dataset.Statistical methods: E.g. Determine thresholds and remove data points exceeding these limits. Commonly used threshold values include multiples of the mean (e.g., 1.5x, 2x, 3x) or standard deviation (e.g., 1.5σ, 2σ, 3σ).
Trimming: remove the records with outliers from the dataset before doing analysis
Skew and Kurtosis
Skewness and kurtosis are statistical measures that describe the shape of a dataset's probability distribution. They help quantify the degree of symmetry and peakedness in a dataset, respectively.
# Calculating skew and kurtosis
skewness = df['nums'].skew()
kurtosis = df['nums'].kurt()
A negatively skewed distribution is characterized by more values being greater than the mean, whereas a positively skewed distribution has more values less than the mean.
High kurtosis in a standard distribution indicates a sharp peak (many data points are clustered around the mean), whereas low kurtosis suggests a flatter peak (few data points around the mean, many points are distant from the mean).
Correlation
Correlation is a measure of the strength and direction of the relationship between two variables. It indicates how much one variable changes in relation to another variable and it only applies to numeric data.
Positive correlation implies that as one variable increases, the other variable tends to increase too. Conversely, negative correlation occurs when one variable increases and another variable decreases.
# Calculate the correlation
# NOTE: Drop non-numeric columns
# Or create a new table (dataframe) of numeric data only
df.corr()
Note: Correlation does not equate causation. It does not prove that one variable causes the other to change.