Statstics

Correlation Coefficients and Bivariate data analysis

Published on August 21, 2023
360 Admin

Table of Contents

1 Introduction
2 Bivariate Data
3 Scatter diagram
4 Correlation Coefficients

Introduction

Bivariate data analysis is a fundamental aspect of statistics that involves studying the relationship between two variables. It provides insights into how changes in one variable are associated with changes in another. In this article, we will explore the concepts of bivariate data, scatter diagrams, correlation coefficients, and rank correlation.

Bivariate Data

Bivariate data consists of pairs of observations or measurements taken from two different variables for each individual or item. For example, consider a dataset where we record the hours studied $x$ and the corresponding test scores $y$ of several students. Each student’s data point would be represented as $x,y$ .

Scatter diagram

A scatter diagram (also known as a scatter plot) visually represents bivariate data points on a graph. The $x$ and $y$ values are plotted as points, allowing us to observe patterns, trends, and relationships between the two variables. A scatter diagram helps us identify whether there is a positive, negative, or no correlation between the variables

Correlation Coefficients

Correlation coefficients quantify the strength and direction of the linear relationship between two variables. Three common types of correlation coefficients are:

A. Simple Correlation :

Definition: Simple correlation measures the strength and direction of the linear relationship between two variables ( $x$ and $y$ ). It is represented by the Pearson correlation coefficient ( $r$ ).

Mathematical Notation: The Pearson correlation coefficient ( $r$ ) between two variables $x$ and $y$ is calculated as:

$r = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2} \sum{(y_i - \bar{y})^2}}}$

Where $x_i$ and $y_i$ are individual data points, $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$ respectively.

Real-Life Example: Consider a dataset that records the hours studied ( $x$ ) and the corresponding test scores ( $y$ ) of several students. The Pearson correlation coefficient ( $r$ ) will quantify the strength and direction of the linear relationship between study hours and test scores. If $r$ is close to 1, it indicates a strong positive correlation, implying that as study hours increase, test scores tend to increase as well.

B. Partial Correlation

Definition: Partial correlation examines the relationship between two variables ( $x$ and $y$ ) while controlling for the influence of a third variable ( $z$ ).

Mathematical Notation: The partial correlation coefficient ( $r_{xy.z}$ ) between variables $x$ and $y$ , controlling for variable $z$ , is calculated as:

$r_{xy.z} = \frac{r_{xy} - r_{xz} \cdot r_{yz}}{\sqrt{(1 - r_{xz}^2) \cdot (1 - r_{yz}^2)}}$

Where $r_{xy}$ is the simple correlation coefficient between $x$ and $y$ , $r_{xz}$ is the simple correlation coefficient between $x$ and $z$ , and $r_{yz}$ is the simple correlation coefficient between $y$ and $z$ .

Real-Life Example: Consider a study analyzing the relationship between the hours spent studying ( $x$ ) and exam scores ( $y$ ), while controlling for the effect of sleep hours ( $z$ ). The partial correlation coefficient $r_{xy.z}$ will provide insight into the direct relationship between studying and exam scores while accounting for the influence of sleep.

C. Multiple Correlation

Definition: Multiple correlation analyzes the relationship between two variables ( $x$ and $y$ ) while considering the impact of additional predictor variables ( $z_1, z_2, \ldots, z_n$ ).

Mathematical Notation : The multiple correlation coefficient ( $R_{xy.z_1z_2\ldots z_n}$ ) between variables $x$ and $y$ , considering the predictor variables $z_1, z_2, \ldots, z_n$ , is calculated as:

$R_{xy.z_1z_2\ldots z_n} = \sqrt{\frac{r_{xy}^2 - r_{xz_1}^2 - r_{xz_2}^2 - \ldots - r_{xz_n}^2}{1 - r_{xz_1}^2 - r_{xz_2}^2 - \ldots - r_{xz_n}^2}}$

Where $r_{xy}$ is the simple correlation coefficient between $x$ and $y$ , and $r_{xz_1}, r_{xz_2}, \ldots, r_{xz_n}$ are the simple correlation coefficients between $x$ and each predictor variable $z_1, z_2, \ldots, z_n$ .

Real-Life Example: Suppose you want to predict a student’s final exam score ( $y$ ) based on the number of hours studied ( $x_1$ ), sleep hours ( $x_2$ ), and attendance in review sessions ( $x_3$ ). The multiple correlation coefficient $R_{xy.x_1x_2x_3}$ will help assess the collective impact of studying, sleep, and attendance on the final exam score.