What are the main measures and graphs of descriptive statistics? - Chapter 3

3.1 Which tables and graphs display data?

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

Example (relative) frequency distribution:

Gender

Frequence

Proportion

Percentage

Male

150

0.43

43%

Female

200

0.57

57%

Total

350 (=n)

1.00

100%

Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then it's easier to divide them into intervals.

A stem-and-leaf plot represents each observation using a stem and a leaf; two numbers that form an observation if you put them together. This kind of graph only is useful if there is few data available and you want to show the data quickly.

When visual displays are given for a population, then they're called population distributions. When they're given for samples, they're called sample distributions.

The data can be shown using a curve in a graph. The bigger the sample and the more data, the more similarities between the sample graph and the curve of the population. The shape of a graph contains information on the distribution of the data. Most used is the normal distribution, a bell shape. This shape is symmetrical. If the x-axis indicates the value of a variable, then the y-axis indicates the relative frequency of the value. The highest point is in the middle, so the value in the middle is the most prevalent.

Normale verdeling, normale distributie, normal distribution

Another possibility is a U-shaped graph. The most prevalent values are then the lowest and the highest scores, which indicates polarization.

The two ends of a curve are called tails. If one tail is longer than the other and the distribution isn't symmetrical, then the distribution must be skewed either to the right or to the left.

3.2 How do you describe the center of data using mean, median and mode?

The average is the most well known measure to describe the center of data for a frequency distribution of a quantitative variable. The average is also called the mean and it is calculated as the sum of the observations divided by the total number of observations. For example, if a variable (y) has the values 34 (y1), 55 (y2) and 64 (y3), then the mean (ȳ) is (34 + 55 + 64)/3 = 51. The mean is pronounced as y-bar.

The formula for calculating the mean is: ȳ = Mean, average, gemiddelde.

The symbol ∑ is the Greek letter sigma, this means the sum of what is behind. The small letter i means 1 till n (the sample size). So ∑ yi means y1 + y2 + … + yn (the sum of all observations).

The mean can only be used for quantitative data and is very sensitive to outliers; exceptionally high or low values.

For multiple samples (n1 and n2), multiple means can be found (ȳ1 and ȳ2).

Another way to describe the center is the median. The median is the observation that falls in the middle of the ordered sample. If a variable has values 1, 3, 5, 8 and 10, then the median is 5. In case of an even number of observations, such as 1, 3, 8 and 10, then the median is (3 + 8)/2 = 5,5.

Important rules about the median are:

  • Apart from quantitative data the median can also be found for categorial data on an ordinal scale, because the median requires a certain order in the observations.

  • For completely symmetrical data the median and the mean should be the same.

  • The mean lies closer to the tail than the median for a skewed distribution.

  • The median is not sensitive to outliers. This is both positive and negative. On the one hand, if there is just one outlier in the data, the median doesn't give a biased portray of the data. On the other hand, there can be a huge variability and the median might still give the same value.

Compared to the mean, the median represents the sample better in case of outliers. The median gives more information if the distribution is very skewed. However, there are also cases where the median is less favorable for representing the data. When the data is only binary (only 0 or 1), then the median is the proportion of the number of times that 1 is observed. Also in other cases where the data is highly discrete, the mean represents the data better than the median does.

Another position is the mode; the value that is most prevalent. The mode is useful for very discrete variables, mostly categorical data.

3.3 How can you measure the variability of data?

The variability of data refers to the values of a variable from the data, for instance the income from the respondents. The variability can be displayed in several ways.

First, the range can be calculated; the difference between the lowest and the highest observation. As an example for the values 4, 10, 16 and 20. The range is 20 – 4 = 16.

However, the most used method for showing the variability of data, is calculating the standard deviation (s). A deviation is the difference between a measured value (yi) and the mean of the sample (ȳ), so it is (yi – ȳ). Every observation has its own deviation, positive when the observation has a higher value than the mean, negative when the observation has a lower value than the mean. It's possible to calculate this for each observation separately but it's also possible to calculate the standard deviation of a variable, by using the sum of all deviations. The formula for the standard deviation is:

The upper part of the formula, ∑ (yiȳ)2, is called the sum of squares. This part squares all the deviations from the observations. The information given by the standard deviation, is how much an observation typically deviates from the mean, so how much the data varies. When the standard deviation is 0, there is no variability at all.

The variance is:

S2

The variance is the mean of the squares of the deviations. The standard deviation is used more often as an indication of the variability than the variance.

When data is available for the entire population, then instead of n-1 the population size is used for calculating the standard deviation.

For interpreting s, the so-called empirical rule can be used for bell-shaped distributions:

  • 68% of data lies between ȳ – s and ȳ + s.

  • 95% of data lies between ȳ – 2s and ȳ + 2s.

  • Most or all of observations lie between ȳ – 3s and ȳ + 3s.

Outliers have a big effect on the standard deviation.

3.4 How can you measure quartiles and other positions on a distribution?

Distributions can be interpreted with several kinds of positions. One way to divide a distribution in parts, is using percentiles. The pth percentile is the point where p% of the observations fall below or at that point and the rest of observations, (100-p)%, falls above. A percentile indicates a point in a graph, not part of a graph.

Another way is to divide a distribution in four parts. The 25th percentile is then called the lower quartile and the 75th percentile the upper quartile. Half of data is inbetween and is called the interquartile range (IQR). The median splits the IQR in two parts. The lower quartile is the median of the first half and the upper quartile is the median of the second half. An advantage of the IQR compared to the range and the standard deviation is that the IQR is insensitive to outliers.

Five positions are often used to give a summary of a distribution: minimum, lower quartile, median, upper quartile and maximum. The positions can be shown in a boxplot, a graph that indicates the variability of data. The box of a boxplot contains the central 50% of the distribution.

The horizontal lines of a boxplot towards the minimum and maximum are called the whiskers. Extreme outliers are indicated with a spot outside of the whiskers. An observation is regarded an outlier when it falls more than 1,5 IQR below the lower quartile or above the upper quartile. A boxplot makes the outliers very explicit, this should be a trigger for the researcher to check again if the research methods have been used properly.

Several sorts of graphs help to compare two or more groups, for instance a relative frequency distribution, histogram or two boxplots next to each other.

Another position is the z-score. This is the number of standard deviations that a value differs from the mean. The formula is: z = (observation – mean) / standard deviation. Contrary to other positions, the z-score can give information about a specific value.

3.5 How do you call statistics for multiple variables?

Statistics is often about the association between two variables; whether one variable has an influence on another. This is called bivariate analysis.

Most often a research studies the effect of an explanatory variable (also called independent variable) on a response variable (also called dependent variable). The output of the response variable is caused by the explanatory variable.

The influence from one variable on another can be portrayed graphically in several ways. A contingency table lists the results with the combination of variables. A scatterplot is a graph with the explanatory variable on the x-axis and the response variable on the y-axis. For every outcome that suffices both variables a dot is shown. The intensity of an association is called the correlation. Regression analysis predicts the value of y for a given value x. When an association exists between variables, this doesn't necessarily mean that there is causality. For multiple variables, multivariate analysis is used.

3.6 Which letters are used in formulas to mark the difference between the sample and the population?

In statistics it's important not to loose sight of the difference between the statistic that describes only the sample and the parameter that describes the entire population. Greek letters are used for the population parameters, Roman letters are used for the sample statistics. For a sample ȳ indicates the mean and s indicates the standard deviation. For a population μ indicates the population mean and σ the standard deviation of the population. The mean and the standard deviation can also be regarded as variables. For a population this isn't possible, because there is only one population.

Image

Access: 
Public

Image

Join WorldSupporter!
This content is related to:
This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Statistics: selected suggestions, summaries and tips of WorldSupporters

Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Activity abroad, study field of working area:
This content is also used in .....

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: Annemarie JoHo
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
2483 2