Lab 5: Frequency Distributions
Outline
Frequency distribution table
Frequency distribution graphs
Make a frequency distribution table and graph using SPSS
Examination of properties of distributions: Shape; outliers
Download Lab 5 Worksheet.
For this lab we'll use a new data file that includes hypothetical course grade information. Download the file Students.sav. To get this file, click on Students.sav and then select "save". You may need to name the file as "students.sav" so that SPSS will recognize it (all SPSS data files must have a .sav ending in the name). Put the file someplace to save it for future use. After saving the file you should open it up in SPSS. Open SPSS and then open the data file students.sav from where you saved it.
In this file there are a number of variables. For now we'll just look at quiz1 and quiz2. Your task for this part of the lab is to create a frequency distribution table for each of these variables and to compare them to get a feel for some of the features of distributions.
Go to SPSS (which should already be open if you followed the instructions above). The students.sav file should be open already.
We'll start by looking at quiz1. Think about each of the following questions.
What's the variable of interest?
- scores on quiz 1 - which corresponds to the data in column labeled quiz 1
What kind of variable is quiz1?
- The quiz scores are numerical, so the variable is quantitative.
What
is the most typical score on quiz 1?
What is the range of scores on quiz 1?
Did some people do a lot worse or better than the rest of the class?
Overall, was the quiz easy or hard?
It is hard to answer these last 4 questions just by looking at the numbers as they are. Instead we can start using some statistical procedures to organize the data, to make it easier to understand the data.
The first thing that we can do is to "sort" the datafile by quiz1.
Step 1: To do this select "sort cases" under the "data" menu.
Step 2: Then select "quiz1" for the sort variable field.
By sorting the file you can begin to see the pattern of the distribution. For example, now it is easy to see what the lowest and highest scores are (now at the top and bottom of the column). However usually just sorting the variable isn't enough. Another statistical tool to help "see" the distribution is to make a frequency distribution table.
A frequency distribution is an organized tabulation of the number of individuals located in each category on the scale of measurement.
STEP 1: What is the range of responses (highest and lowest numbers)? The X column has been filled in for you based on the range of responses. (Your book often uses Y to refer to scores in a data set. X and Y can be used interchangeably.)
(1)
_________________________________
X f p % c%
10
9
8
7
6
5
4
3
2
1
0
________________________________
Scores on
quiz1 range from 0 to 10 so we list these values in the X column starting with
the highest value and listing each value down to the lowest.
STEP 2: How many of each did we get?
Fill in the f column. This is the frequency of occurrence. For each X value list in the f column next to it how many of those scores were listed in the quiz1 column in the students.sav file.
This tells you how many of each response we got. Note that there may be 0's in the f column if no one got that particular score.
Notice
that if you add up the frequency column, you get the total number of
observations.
S f = N
If you
wanted to know what the total of all of the X's was, how would you do it? The
easiest way would be to multiply the (X) & (f) columns and then add (sum)
the results.
S (Xf )
Calculate the sum of all the scores using this formula
Now let's work on the other columns in the table.
STEP
3: Proportions How much of the total group got this value for X? How do
you get this information?
p = f / N
Recall that N = the total number of observations.
Fill
in the p column for each X value by caluculating the proportion of all
scores from the value you listed in the f column.
STEP
4: Percentages What percentage of the group got each value of X? To get
this, convert the proportions to percentages.
p * 100
Fill these values into the % column in the table.
STEP 5: c% The c% column is cumulative percentage. Basically all you do here is start from the lowest and go up the chart adding together the percentages. Think back to getting your ACT scores. You may remember something like "your score is in the 76^{th} percentile. This means that 76% of the people who took the test got your score or worse. Notice that the final c% (on the top of the chart) should always equal 100 (because 100% of the people could get the maximum score or worse).
Fill in the c% column in the table by adding each c% to the next % value, starting from the bottom (X value of 0).
From a frequency distribution table you can "see" the distribution more easily. At a glance you can see what the highest and lowest scores are, whether some scores are "outside" of the rest (that is did a few people really bomb the test or did a few ace it), what the most common score was, where most of the scores were, etc.
Click here to see what your quiz1 frequency distribution table should look like when it's completed. Note that this table is in reverse order to the one you were asked to make (i.e., 10 is at the bottom). This is a feature specific to SPSS so be careful when interpreting frequency distribution tables created by SPSS.
Now
look at your finished frequency distribution table and answer the following
questions:
(2) What percentage of the scores is at or below a score of 7?
(3) Where does it appear that most of the scores are located?
(4) What does your answer to (2) tell you about the difficulty of the quiz?
When there
are too many different response categories to list every category in a frequency
distribution table, we can group the scores into class intervals and
use the intervals as the X values in our table. For example, think of a percentage
grading scale, (A = 90-100, B = 80-89, ...). Percentage grades can be any value
between 0% and 100%. We'll use the percent variable in the students.sav file
to make a grouped frequency distribution table. We'll group the scores into
typical grading categories (i.e., A = 90%-100%, B = 80%-89%, etc.)
I've set up the table below for with class intervals as the X values.
(5) Please finish the table below for the variable percent, which represents final course grades for the students.sav file. You will need to count frequencies for each interval and then follow the steps you did above for filling in the p, %, and c% columns. Note that you may need to round some scores to place them into categories. Round below .5 to the fill percentage below and .5 and above to the full percentage above.
________________________________________ X f p % c% 90-100 80-89 70-79 60-69 50-59 _40-49_________________________________
SPSS will also create this table for you. Go to the "Analyze" menu, select "Descriptive statistics", and within that sub menu select "Frequencies".
SPSS will then ask you for which variable you want the table for.
For quiz 1 the frequency table output should look something like this:
(6) Please create a frequency distribution table using SPSS for quiz2. You may either print out the output created and staple it to your lab worksheet to hand in, or you can try to "cut" and "paste" the graphs directly into this worksheet.
Compare the two frequency distributions (quiz 1 and quiz 2) and answer the following questions:
(7) For which quiz do the scores appear to be more evenly distributed across
the scale?
(8)
Which quiz appeared to be harder? How do you know this?
In the sections we saw that one way to summarize and simplify an entire distribution of scores is by organizing the scores in a frequency distribution table. In this section we will learn about several other ways to represent distributions, focusing primarily on graphic displays: bar charts, histograms, and stem-and-leaf plots.
Bar graphs
To display the distribution of a categorical variable one should use a bar graph (pie charts are also used, but we won't be discussing these in this lab). Within SPSS, there are a number of different kinds of bar charts that it will make (simple, clustered, and stacked), we'll focus on simple and clustered.
Bar chart: (simple, clustered, and stacked): These are used most often to display the distribution of subjects or cases in certain categories, such as the number of A, B, C, D, and F grades in a given class. |
Let's start with looking at the distribution of ethnicity (variable: ethnicit) in our students.sav datafile. So what our graph will show are the counts (or frequency) for each of ethnic category.
Step 1: First select bar graph from the menu.
Step 2: Then select "simple" from the bar chart box.
Step 3: Then click define.
Step 4: Then select your variable and insert it into the category field
You should get a bar chart that looks something like this.
Bar charts are also useful for presenting distributions that are "broken into" different categories.
For example suppose that we wanted to know the mean scores (basically the arithmetic average, we'll talk more about means next week) on quiz 1 (so these scores are a response variable) broken down by the three different sections (our categorical response variable).
Step 1: We'd select bar graph, select simple, but then we need to make some different selections in the bar graph window.
Step 2: We need to click on "other summary function", and then select the variable for which we want to plot the means (quiz 1 in this case). The default summary function is mean. Then we need to put the category variable in the category field.
We should end up with a graph that looks like this.
Suppose that we want to look at the same means by section but broken down by ethnicity. To do this we must use a clustered bar graph.
So select bar graph, then chose clustered. Now enter things as we did in the example above, except we must also select ethnicity for the 'cluster bars by' field.
We should end up with a graph that looks like this:
(9) Make a bar graph of the counts of the final grades (variable "grades" in the file) in the class (i.e. A, B, C,...). What was the most common grade in the course? Copy into worksheet.
(10) Make a bar graph of the counts of the final grades in the class (i.e. A, B,
C,...), further broken down by whether they attended the review session or not.
Copy into worksheet. Based on the graph, would you conclude that attending the
review session had an impact on final grades? Why?
Histograms
Suppose that we wish to know how the students did on quiz 1. We could try looking at all of the scores, but that's a lot of numbers. Instead, it is better to try to look at the entire distribution, rather than all of the individual scores. In the last lab we did this by creating frequency distribution tables. Another way to do it is to construct a histogram to represent the entire distribution. We should use a histogram because our variable (score on quiz1) is a continuous variable.
Histogram: A histogram is a pictorial representation of the distribution of values for a particular variable. The bars represent the number of occurrence of each value. These look similar to bar graphs except they are used more often to indicate the number of subjects or cases in ranges of values for a continuous variable, such as the number of subjects or cases in ranges of values for a continuous variable. |
Using SPSS to create a histogram:
Creating a histogram of the students scores on quiz1.
Step 1. At the top of the data window is a row of menus. To make graphs we will use the 'graphs' menu.
Step 2. Under this menu a large number of graphing options will appear. On the bottom third of the list is 'histogram'. This is the option that we'll use to look at distributions (for this lab at least).
Step 3. Select histogram. Now you'll get a window that looks like this:
Step 4. Select 'quiz 1' as your variable and then click okay. This should result in a new window (the output window) opening up, and it should have your histogram in it.
The histogram of quiz 1 is basically just a picture of the frequency distribution table. Below is a frequency distribution table and a histogram for quiz 1.
For quiz 1 the frequency table output should look something like this:
In this case the histogram is a little different than you might expect after comparing it to the frequency distribution table above. Why?
Because, the above histogram is based on a Grouped frequency distribution table of quiz 1 (see previous lab for discussion). Go ahead and group scores 10 & 9, 8 & 7, 6&5, etc. and see if now the histogram looks as you'd expect it would.
An important lesson from this is that the size of the interval that you plot may influence the overall shape of the histogram. Below is a histogram of the quiz 1 scores. Use the sliding arrow to change the bin width and observe how the apparent shape of the distribution changes.
(11) Make histograms of quiz 2, 3, and 4.
Note: They should all be added to the same output window (so don't close the output window until you're done with the lab).
Now that we have a feel for how to look at distributions of variables let's return to our three quizzes (quiz 2, 3, & 4).
(12) Which quiz was the hardest? Which was the easiest? Why do you come to that conclusion?
(13) Which quiz(zes) was/were positively skewed? Which quiz(zes) was/were negatively skewed? Are there any that are not skewed (i.e. are roughly symmetric)?
(14) Are there any scores that may be potential outliers?
Hint: There are 3 characteristics used that completely describe a distribution: shape, central tendency, and variability (we'll also consider outliers). We'll be talking about central tendency (roughly, the center of the distribution) and variability (how broad is the distribution) in future labs.