Let's take another illustration. For many years, it was known that people who smoked had a higher risk of developing lung cancer, that there was a correlation between smoking and lung cancer. But the office of the Surgeon General couldn't quite say yet that smoking causes lung cancer. Why not? The reason is there were no controlled experiments showing a cause and effect relationship between smoking and lung cancer. To make such a cause and effect relationship, you must have a controlled experiment, and no such long term controlled experiments were available yet. (Now we do have such long-term controlled experiments.) So how else could we explain the correlation, the relationship, between smoking and lung cancer? Perhaps people who smoked engaged in other types of risky behavior and that's what caused them to have the lung cancer, not the smoking per se.
When we talk about correlations, we talk about (1) the direction and (2) the strength of the correlation coefficient.
Example #1: Income and education have a positive correlation. People with higher incomes also tend to have more years of education. People with fewer years of education tend to have lower income.
Example #2: SAT scores and college achievement have a positive correlation. Among college students, those with higher SAT scores, generally have higher GPAs.
Example #3: Number of hours studied and Regents scores have a positive correlation. Students who study more generally get higher scores on Regents exams than those who study less.
When we make a scatter plot, we don't connect the dots. Instead, we draw the best fitting straight line.
Negative Correlation
A negative correlation is a line with a negative slope and will fall downhill from left to right. It looks as follows:
In a negative correlation, as one variable increases the other decreases. Example: As the number of hours a student watches TV the night before a test increases, the score on the test usually decreases.
Zero Correlation
A zero correlation is a line with a zero slope, a horizontal line.
A vertical line has undefined slope and also indicates that no relationship exists between two variables.
REQUIREMENTS FOR CALCULATING THE PEARSON r
The 2 main requirements for calculating the Pearson r are that
(1) the sample of paired data (x,y) be a random sample of collected data; and
(2) the underlying relationship between x and y is a linear one. This means that a visual examination of the scatter plot reveals that the points approximate a straight line.
FORMULA TO CALCULATE THE PEARSON r
n represents the number of pairs of data present. Σ denotes the addition of the items indicated. Σx denotes the sum of all x-values. Σx² indicates that each x-value should be squared and then those
squares added. (Σx)² indicates that the x-values should be added and the total then
squared. It is extremely important to avoid confusing Σx² & (Σx)² Σxy indicates that each x-value should first be multiplied by its
corresponding y-value. After obtaining all such products, find
their sum. r represents the linear correlation
coefficient for a sample. ρ Greek letter rho used to represent the linear correlation
coefficient for a population.
Notice that what we have here is the formula for calculating r, the sample correlation coefficient, because it is based on only a sample of data (x,y). We haven't sampled the entire population! Had we been able to sample the entire population (which is usually impossible to do) what we would have is the population correlation coefficient, ρ.
EXAMPLE CALCULATING r
Let's say we randomly select a group of students in a class and give them two quizzes, a social studies and a math quiz. Let's say there are four students and their scores on the quizzes are as follows:
We want to find out what is the correlation between performance on one quiz and performance on the other? Do people who do better on one quiz necessarily do better on the other, or is the opposite true? Or, perhaps there's no relationship; between performance on one quiz and performance on the other?
Let's see if we meet the criteria for doing a Pearson r. We meet requirement #1 because our sample was randomly selected. We meet requirement #2 because if we look at a scatter plot of our data (given below), we see the data approximate a straight line.
So, we can now proceed to calculate the Pearson r. We can either do this by entering our values into Excel or we can organize our data and perform the necessary calculations using a calculator as follows:
For our given sample of paired data, n = 4 because there are 4 students. We can now use the formula to evaluate r as follow:
In cases where we have not set up a true experiment, the correlation coefficient can be used to answer interesting questions about real-world relationships. We consider now an example taken from the nuptials section of the New York Times.
1. In 2007, do highly educated men marry highly educated women? Or, otherwise stated, is there a correlation between the bride's and groom's highest educational level?
Try to answer the first question. You'll have to use some rating scale, such as the one below, in order to code level of education:
Educational Level | Score |
---|---|
High School | 1 |
Some College Credits | 2 |
College Degree | 3 |
In a Master's Program | 3.5 |
Master's Degree | 4 |
PhD, MD, or Law Degree | 5 |
* A postdoctoral fellow is an individual who holds a PhD already
and would recieve a rating of 5 * A professor usually holds a PhD * A nephrologist is a kidney doctor. He would receive a rating of 5 * An MBA is a master's degree in business and would receive a
rating of 4
In your calculations, let "X" be the bride's highest educational level and let "Y" be the groom's highest educational level.
x y xy x² y² 3.5 4 14 12.25 16 3 3 9 9 9 4 4 16 16 16 3 1 3 9 1 4 5 20 16 25 3 3 9 9 9 4 2 8 16 4 3 5 15 9 25 5 4 20 25 16 5 5 25 25 25 __________________________________________ 37.5 36 139 146.25 146 ↑ ↑ ↑ ↑ ↑ Σx Σy Σxy Σx² Σy² r = n(Σxy)-(Σx)(Σy) √[n(Σx²)-(Σx)²] √[n(Σy²)-(Σy)²] = 10(139)-37.5(36) √[10(146.25)-(37.5)² √[10(146)-(36)²] = 40 = 40 = .42 (7.5)(12.8) 96.04