Covariance Matrix

In my first machine learning class, in order to learn about the theory behind PCA (Principal Component Analysis), we had to learn about variance-covariance matrix. I was concurrently taking a basic theoretical probability and statistics, so even the idea of variance was still vague to me. Despite the repeated attempts to understand covariance, I still had trouble fully capturing the intuition behind the covariance between two random variables. Even now, application and verification of correct usage of mathematical properties of covariance requires intensive googling. I want to use this post to review covariance and its properties.

Basic Idea : Covariance of two random variables \(X\) and \(Y\), also written as \(cov[X,Y]\) and \(\sigma_{XY}\), is a strength of linear relationship between two random variables. It can be used to determine the directional strength of relationship between two random variables. Covariance is easily visualized by observing a distribution of bivariate normal distribution.

Consider a bivariate normal distribution (non degenerate) with random variables X and Y, where \(\mu_X, \mu_Y = 0\), \(\sigma_X, \sigma_Y = 1\), and \(\sigma_{XY}\) takes different values.

library(MASS)
library(mixtools)
## Warning: package 'mixtools' was built under R version 4.0.5
## mixtools package, version 1.2.0, Released 2020-02-05
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.
set.seed(1234)
gen_binormplot <- function(N, mu_vec, Sigma, ...){ #function that creates a plot for randomly generated bivariate normal distribution
  biv_norm <- mvrnorm(
    n = N,
    mu = mu_vec,
    Sigma = Sigma
  )
  plot(x = biv_norm[, 1],
       y = biv_norm[, 2],
       main = bquote('bivariate normal distribution when' ~ sigma['xy'] == .(Sigma[2])),
       xlab = 'x',
       ylab = 'y',
       col = 'dark grey',
       ...)
  for (ellipse_alpha in seq(0, 1, 0.2)){
    ellipse(mu_vec,
            Sigma,
            alpha = ellipse_alpha,
            type = 'l',
            col = 'red',
            lwd = 1.5)
  }
}

N <- 1000
mu_x <- 0; mu_y <- 0
var_x <- 1; var_y <- 1
cov_xy <- c(-0.9, 0.9, -0.7, 0.7, -0.3, 0.3, 0)

par(mfrow = c(2,2), mai = c(0.5, 0.5, 0.5, 0.5))
for (plt in 1:length(cov_xy)){
  Sig <- matrix(c(var_x, cov_xy[plt], cov_xy[plt], var_y), nrow = 2, byrow = TRUE)
  gen_binormplot(N, c(mu_x, mu_y), Sig, xlim = c(-3, 3), ylim = c(-3, 3))
}

The ellipses show the region in which a certain portion of the points from a bivariate normal distribution falls into. The outermost ellipse contains 95% of the points. Also, the ellipses allows us to see more easily the general trend that the two random variables have. As you can see, negative covariance is associated with inverse relationship between X and Y, and positive covariance is associated with direct relationship. Furthermore, the ellipses tend to get wider as the absolute value of covariance decreases. This is because lower absoluate value of covariance is regarded as the two random variables having a “weaker” relationship. In fact, the ellipses in the case where \(\sigma_{XY} = 0\) would be a perfect circle had the plot size been square (which is also ascribed to the variance of X and Y being equal).

Mathematical Definition: \(Cov[X, Y] = E\big[(X-E[X])(Y-E[Y])\big]\)

In words, the covariance of random variables \(X\) and \(Y\) is an expectation of the product of the two variables’ deviation from the mean.

From this definition, I like to understand the previously seen trend (where negative covariance is related to inverse relationship and positive covariance is related to direct relationship) by thinking of extreme cases. If X and Y were to have inverse (linear) relationship, where increasing X is related to decreasing Y, then higher values of X, which is above its mean will be paired with lower values of Y, which is below the mean. Then the deviation for X and Y will have opposite signs. So the expected value of their product would be negative. Similar reason applies for positive covariance.

Covariance can also be written as: \(Cov[X, Y] = E[XY] - E[X]E[Y]\).

With this formula, it is easier to see why two independent random variables have covariance of 0, since for two independent random variables X and Y, \(E[XY] = E[X]E[Y]\). However, the converse is not always true.

A simple example:

\(X \sim unif(-1, 1)\)

\(Y = X^2\)

Then \(Cov[X, Y] = E[XY] - E[X]E[Y] = E[X^3] - E[X]E[X^2] = 0 - 0 \times E[X^2] = 0\), but X and Y are clearly dependent.

Plot of samples from these two random variables will look like

x <- runif(n = 100, min = -1, max = 1); y <- x^2
plot(x = x, y = y, xlim = c(-1.5, 1.5), ylim = c(-0.5, 1.5), xlab = "X", ylab = "Y")

Clearly the relationship between the two random variables is not linear, thus covariance alone would not be a good method to measure the strength of relationship between the two random variables.

The strength of linear relationship between two random variables is hard to determine by solely looking at the covariance because the value depends on how the random variables are scaled. Hence the covariance is often normalized into Pearson correlation coefficient (\(\rho\)), which ranges from -1 to 1, so that it is easier to interpret.

The definition of correlation coefficient is given by: \[\rho_{_{XY}} = \dfrac{Cov[X,Y]}{Sd[X]Sd[Y]} = \dfrac{\sigma_{_{XY}}}{\sigma_{_X}\sigma_{_Y}}\] where \(Sd\) is standard deviation. This is easier to interpret than covariance because it is scaled down to in between -1 and 1. Closer absolute value to 1 means that the linear relationship is stronger.

Properties of covariance

\(Cov[X, a] = 0\)

where a is a constant. In words covariance between a random variable and a constant is 0.

\(Cov[X,X] = Var[X]\)

Covariance of a random variable with itself is a variance.

\(Cov[X,Y] = Cov[Y,X]\)

Covariance of two random variables is the same when the order is flipped. This property leads to the variance-covariance matrix being symmetric

\(Cov[aX,bY] = abCov[X,Y]\)

This property tells me that covariance multiplicately scales corresponding to the changes in the random variables X and Y. This is due to linearity of expectation. The coefficient for each random variable can be pulled out individually because covariance is an expected value of the product of each random variables. Different scaling doesn’t change the strength of relationship, even though higher covariance deceptively indicates so (I could be the only person that had this misconception).

N <- 1000
mu_x <- 5; mu_y <- 5
var_x <- 1; var_y <- 1
cov_xy <- 0.7

bivnorm <- mvrnorm(n = N, mu = c(mu_x, mu_y), Sigma = matrix(c(var_x, cov_xy, cov_xy, var_y), nrow = 2, byrow = TRUE))
par(mfrow = c(1, 2))
plot(bivnorm[,1], bivnorm[,2], xlim = c(1, 9), ylim = c(1, 9), xlab = 'X', ylab = 'Y')
plot(10*bivnorm[,1], bivnorm[,2], xlim = c(10, 90), ylim = c(1, 9), xlab = '10 * X', ylab = 'Y')

\(Cov[X + a, Y + b] = Cov[X,Y]\)

Like variance, covariance is invariant to locational shifts

\(Cov[aX + bY,cW + dZ] = acCov[X,W] + bcCov[Y,W] + adCov[X,Z] + bdCov[Y,Z]\)

If each coordinate is a linear combination of random variables, then each random variable ‘term’ can be taken out individually. Hence the covariance is equivalent to the summation of the covariance of each ‘pair’ of random variables.

Sample Covariance

Usually population parameters will be unknown and I’ll have to use samples to calculate the covariance. The formula for sample covariance is as follows:

\[Cov[X,Y] = \dfrac{1}{n-1}\sum_{i=i}^{n}(X_i - \bar{X})(Y_i - \bar{Y})\]

This blog translates the calculation of covariance to geometrical sense nicely.

Moving to variance-covariance matrix

Variance-covariance matrix allows us to see covariance of all pairs of components in a random vector. For a random vector \(\textbf{X}\) of length \(N\), the variance-covariance matrix of a random vector X (denote as \(Var(\textbf{X})\), \(Cov(\textbf{X})\), \(\Sigma_\textbf{X}\)) is an \(N \times N\) matrix whose \(i^{th}\) row and \(j^{th}\) is a covariance between \(X_i\) and \(X_j\).

\(\textbf{X} = \begin{pmatrix} X_1 \\ \vdots \\ X_N \end{pmatrix}\)

\(\Sigma_\textbf{X} = \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1N} \\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{N1} & \sigma_{N2} & \cdots & \sigma_{NN} \end{pmatrix}\)

Applying definition of covariance, the \(i^{th}\) row and \(j^{th}\) column is \(\Sigma_{ij} = E\big[(X_i-E[X_i])(X_j-E[X_j])\big]\).

The following are some properties of \(\text{ } \Sigma\) :

The \(i^{th}\) diagonal entry is a variance of an \(i^{th}\) random varible. f \(\Sigma_\textbf{X}\) is a symmetric matrix due to the third property of covariance listed above.

\(\Sigma_\textbf{X}\) is positive semi-definite. In other words, for \(\text{ } \textbf{v} \in \mathbb{R}^N\), \[\textbf{v}^T \Sigma_\textbf{X} \textbf{v} \geq 0\] I’ll provide a simple proof as practice:

\[\textbf{v}^T \Sigma_\textbf{X} \textbf{v} = \textbf{v}^TE\big[(\textbf{X}-E[\textbf{X}])(\textbf{X}-E[\textbf{X}])^T\big]\textbf{v} \] \[= E\big[\textbf{v}^T(\textbf{X}-E[\textbf{X}])(\textbf{X}-E[\textbf{X}])^T\textbf{v}\big] \] \[= E\big[\textbf{v}^T (\textbf{X}-E[\textbf{X}])(\textbf{v}^T(\textbf{X}-E[\textbf{X}]))^T\big] \] \[= E\big[ \big( \textbf{v}^T(\textbf{X}-E[\textbf{X}]) \big)^2 \big] \geq 0 \text{, } \forall \text{ } \textbf{v} \in \mathbb{R}^N\] Note: In the case that \(\exists \text{ } \textbf{v} \in \mathbb{R}^N \text{ s.t. } \textbf{v}^T\Sigma\vec{v} = 0\), there is a component in a random vector \(X\) that can be expressed through linear combination of other components

\(\Sigma_{A\textbf{X}+b} = A\Sigma_\textbf{X}A^T\)

Correlation matrix, (denote as \(\vec{\rho}\)) whose \(i^{th}\) row and \(j^{th}\) column expresses the correlation coefficient between the respective random variables, can be obtained from the variance-covariance matrix.

We know that \(\vec{\rho}_{ij} = \dfrac{Cov(X_i, X_j)}{\sqrt{Var(X_i)Var(X_j)}}\) and the diagnonal entries of \(\text{ } \Sigma \text{ }\) are variances. Hence, \(\vec{\rho} = \big(diag(\Sigma)\big)^{-\frac{1}{2}} \Sigma \big(diag(\Sigma)\big)^{-\frac{1}{2}}\), where \(diag(\Sigma)\) is a diagonal matrix whose diagonal entries are diagonal entries of \(\Sigma\) and other entries 0.

Calculating variance-covariance matrix from a sample

Suppose we have a set of data whose \(\text{ }N\text{ }\)observations are each row and \(\text{ }p\text{ }\) features (variables) are each column. Denote the data by \(\text{ }D\). To calculate for variance-covariance matrix, the values first need to be centered around their mean, then multiplied and summed accordingly.

Let \(\textbf{D}_c\) be data whose values are centered around their mean. Then variance-covariance matrix \(V\) is:

\(V = \dfrac{1}{N-1} \textbf{D}^T_c\textbf{D}_c\)

Example using attitude dataset:

summary(attitude)
##      rating        complaints     privileges       learning         raises     
##  Min.   :40.00   Min.   :37.0   Min.   :30.00   Min.   :34.00   Min.   :43.00  
##  1st Qu.:58.75   1st Qu.:58.5   1st Qu.:45.00   1st Qu.:47.00   1st Qu.:58.25  
##  Median :65.50   Median :65.0   Median :51.50   Median :56.50   Median :63.50  
##  Mean   :64.63   Mean   :66.6   Mean   :53.13   Mean   :56.37   Mean   :64.63  
##  3rd Qu.:71.75   3rd Qu.:77.0   3rd Qu.:62.50   3rd Qu.:66.75   3rd Qu.:71.00  
##  Max.   :85.00   Max.   :90.0   Max.   :83.00   Max.   :75.00   Max.   :88.00  
##     critical        advance     
##  Min.   :49.00   Min.   :25.00  
##  1st Qu.:69.25   1st Qu.:35.00  
##  Median :77.50   Median :41.00  
##  Mean   :74.77   Mean   :42.93  
##  3rd Qu.:80.00   3rd Qu.:47.75  
##  Max.   :92.00   Max.   :72.00
att_c <- sapply(attitude, function(x){x-mean(x)})
cov_att_manual <- 1/(nrow(att_c) - 1) * (t(att_c) %*% att_c)
cov_att <- cov(attitude)

print(cov_att)
##               rating complaints privileges  learning    raises critical
## rating     148.17126  133.77931   63.46437  89.10460  74.68851 18.84253
## complaints 133.77931  177.28276   90.95172  93.25517  92.64138 24.73103
## privileges  63.46437   90.95172  149.70575  70.84598  56.67126 17.82529
## learning    89.10460   93.25517   70.84598 137.75747  78.13908 13.46782
## raises      74.68851   92.64138   56.67126  78.13908 108.10230 38.77356
## critical    18.84253   24.73103   17.82529  13.46782  38.77356 97.90920
## advance     19.42299   30.76552   43.21609  64.19770  61.42299 28.84598
##              advance
## rating      19.42299
## complaints  30.76552
## privileges  43.21609
## learning    64.19770
## raises      61.42299
## critical    28.84598
## advance    105.85747
print(all.equal(cov_att_manual, cov_att))
## [1] TRUE

Although there will almost never be an occasion where I have to calculate the covariance matrix or even other statistics by hand, I thought it would be a good way to practice with matrices and to keep myself aware of the mathematical end of statistics.

There is also another type of covariance matrix called cross-covariance matrix for two random vectors, but I will not go beyond writing the definition.

\[cov(\textbf{X},\textbf{Y}) = E\big[(\textbf{X}-E[\textbf{X}])(\textbf{Y}-E[\textbf{Y}])\big]\]

There is a lot more to covariance and variance-covariance matrix, but this should be enough to found a solid ground for understanding covariance.

theory 

See also