What are the main measures and graphs of descriptive statistics? - Chapter 3

3.1 Which tables and graphs display data?
3.2 How do you describe the center of data using mean, median and mode?
3.3 How can you measure the variability of data?
3.4 How can you measure quartiles and other positions on a distribution?
3.5 How do you call statistics for multiple variables?
3.6 Which letters are used in formulas to mark the difference between the sample and the population?

3.1 Which tables and graphs display data?

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

Example (relative) frequency distribution:

Gender	Frequence	Proportion	Percentage
Male	150	0.43	43%
Female	200	0.57	57%
Total	350 (=n)	1.00	100%

Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then it's easier to divide them into intervals.

A stem-and-leaf plot represents each observation using a stem and a leaf; two numbers that form an observation if you put them together. This kind of graph only is useful if there is few data available and you want to show the data quickly.

When visual displays are given for a population, then they're called population distributions. When they're given for samples, they're called sample distributions.

The data can be shown using a curve in a graph. The bigger the sample and the more data, the more similarities between the sample graph and the curve of the population. The shape of a graph contains information on the distribution of the data. Most used is the normal distribution, a bell shape. This shape is symmetrical. If the x-axis indicates the value of a variable, then the y-axis indicates the relative frequency of the value. The highest point is in the middle, so the value in the middle is the most prevalent.

Normale verdeling, normale distributie, normal distribution

Another possibility is a U-shaped graph. The most prevalent values are then the lowest and the highest scores, which indicates polarization.

The two ends of a curve are called tails. If one tail is longer than the other and the distribution isn't symmetrical, then the distribution must be skewed either to the right or to the left.

3.2 How do you describe the center of data using mean, median and mode?

The average is the most well known measure to describe the center of data for a frequency distribution of a quantitative variable. The average is also called the mean and it is calculated as the sum of the observations divided by the total number of observations. For example, if a variable (y) has the values 34 (y₁), 55 (y₂) and 64 (y₃), then the mean (ȳ) is (34 + 55 + 64)/3 = 51. The mean is pronounced as y-bar.

The formula for calculating the mean is: ȳ = Mean, average, gemiddelde .

The symbol ∑ is the Greek letter sigma, this means the sum of what is behind. The small letter i means 1 till n (the sample size). So ∑ y_i means y₁ + y₂ + … + y_n (the sum of all observations).

The mean can only be used for quantitative data and is very sensitive to outliers; exceptionally high or low values.

For multiple samples (n₁ and n₂), multiple means can be found (ȳ₁ and ȳ₂).

Another way to describe the center is the median. The median is the observation that falls in the middle of the ordered sample. If a variable has values 1, 3, 5, 8 and 10, then the median is 5. In case of an even number of observations, such as 1, 3, 8 and 10, then the median is (3 + 8)/2 = 5,5.

Important rules about the median are:

Apart from quantitative data the median can also be found for categorial data on an ordinal scale, because the median requires a certain order in the observations.
For completely symmetrical data the median and the mean should be the same.
The mean lies closer to the tail than the median for a skewed distribution.
The median is not sensitive to outliers. This is both positive and negative. On the one hand, if there is just one outlier in the data, the median doesn't give a biased portray of the data. On the other hand, there can be a huge variability and the median might still give the same value.

Compared to the mean, the median represents the sample better in case of outliers. The median gives more information if the distribution is very skewed. However, there are also cases where the median is less favorable for representing the data. When the data is only binary (only 0 or 1), then the median is the proportion of the number of times that 1 is observed. Also in other cases where the data is highly discrete, the mean represents the data better than the median does.

Another position is the mode; the value that is most prevalent. The mode is useful for very discrete variables, mostly categorical data.

3.3 How can you measure the variability of data?

The variability of data refers to the values of a variable from the data, for instance the income from the respondents. The variability can be displayed in several ways.

First, the range can be calculated; the difference between the lowest and the highest observation. As an example for the values 4, 10, 16 and 20. The range is 20 – 4 = 16.

However, the most used method for showing the variability of data, is calculating the standard deviation (s). A deviation is the difference between a measured value (y_i) and the mean of the sample (ȳ), so it is (y_i – ȳ). Every observation has its own deviation, positive when the observation has a higher value than the mean, negative when the observation has a lower value than the mean. It's possible to calculate this for each observation separately but it's also possible to calculate the standard deviation of a variable, by using the sum of all deviations. The formula for the standard deviation is:

$s = \sqrt{\frac{\sum (y_i-\bar{y})^2}{n-1}}$

The upper part of the formula, ∑ (y_i – ȳ)² Click and drag to move , is called the sum of squares. This part squares all the deviations from the observations. The information given by the standard deviation, is how much an observation typically deviates from the mean, so how much the data varies. When the standard deviation is 0, there is no variability at all.

The variance is:

S² Click and drag to move = $\frac{\sum (y_i-\bar{y})^2}{n-1}$

The variance is the mean of the squares of the deviations. The standard deviation is used more often as an indication of the variability than the variance.

When data is available for the entire population, then instead of n-1 the population size is used for calculating the standard deviation.

For interpreting s, the so-called empirical rule can be used for bell-shaped distributions:

68% of data lies between ȳ – s and ȳ + s.
95% of data lies between ȳ – 2s and ȳ + 2s.
Most or all of observations lie between ȳ – 3s and ȳ + 3s.

Outliers have a big effect on the standard deviation.

3.4 How can you measure quartiles and other positions on a distribution?

Distributions can be interpreted with several kinds of positions. One way to divide a distribution in parts, is using percentiles. The pth percentile is the point where p% of the observations fall below or at that point and the rest of observations, (100-p)%, falls above. A percentile indicates a point in a graph, not part of a graph.

Another way is to divide a distribution in four parts. The 25^th Click and drag to move percentile is then called the lower quartile and the 75^th percentile the upper quartile. Half of data is inbetween and is called the interquartile range (IQR). The median splits the IQR in two parts. The lower quartile is the median of the first half and the upper quartile is the median of the second half. An advantage of the IQR compared to the range and the standard deviation is that the IQR is insensitive to outliers.

Five positions are often used to give a summary of a distribution: minimum, lower quartile, median, upper quartile and maximum. The positions can be shown in a boxplot, a graph that indicates the variability of data. The box of a boxplot contains the central 50% of the distribution.

The horizontal lines of a boxplot towards the minimum and maximum are called the whiskers. Extreme outliers are indicated with a spot outside of the whiskers. An observation is regarded an outlier when it falls more than 1,5 IQR below the lower quartile or above the upper quartile. A boxplot makes the outliers very explicit, this should be a trigger for the researcher to check again if the research methods have been used properly.

Several sorts of graphs help to compare two or more groups, for instance a relative frequency distribution, histogram or two boxplots next to each other.

Another position is the z-score. This is the number of standard deviations that a value differs from the mean. The formula is: z = (observation – mean) / standard deviation. Contrary to other positions, the z-score can give information about a specific value.

3.5 How do you call statistics for multiple variables?

Statistics is often about the association between two variables; whether one variable has an influence on another. This is called bivariate analysis.

Most often a research studies the effect of an explanatory variable (also called independent variable) on a response variable (also called dependent variable). The output of the response variable is caused by the explanatory variable.

The influence from one variable on another can be portrayed graphically in several ways. A contingency table lists the results with the combination of variables. A scatterplot is a graph with the explanatory variable on the x-axis and the response variable on the y-axis. For every outcome that suffices both variables a dot is shown. The intensity of an association is called the correlation. Regression analysis predicts the value of y for a given value x. When an association exists between variables, this doesn't necessarily mean that there is causality. For multiple variables, multivariate analysis is used.

3.6 Which letters are used in formulas to mark the difference between the sample and the population?

In statistics it's important not to loose sight of the difference between the statistic that describes only the sample and the parameter that describes the entire population. Greek letters are used for the population parameters, Roman letters are used for the sample statistics. For a sample ȳ indicates the mean and s indicates the standard deviation. For a population μ indicates the population mean and σ the standard deviation of the population. The mean and the standard deviation can also be regarded as variables. For a population this isn't possible, because there is only one population.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Statistics: selected suggestions, summaries and tips of WorldSupporters

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Samenvattingen voor psychologie en gedrag

Institutions, jobs and organizations:

Universiteit Groningen en studieverenigingen

This content is also used in .....

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.

What are statistical methods? – Chapter 1

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

Statistics: selected suggestions, summaries and tips of WorldSupporters

What is statistics as study field?

What are statistical methods? – Chapter 1

What are the main measures and graphs of descriptive statistics? - Chapter 3

Selected contributions for Data: distributions, connections and gatherings

Selected contributions for Understanding logistic regression

Stats for students: Simple steps for passing your statistics courses

Statistics for Social Sciences - Bundle

Statistics: tools and contributions of worldsupporters - 2022-2023

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

Statistics and research: home bundle

Lees verder over Statistics: selected suggestions, summaries and tips of WorldSupporters
2638 keer gelezen

Selected contributions for Data: distributions, connections and gatherings

Selected contributions of other WorldSupporters on the topic of Data: distributions, connections and gatherings

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

Understanding data: distributions, connections and gatherings

Lees verder over Selected contributions for Data: distributions, connections and gatherings
1932 keer gelezen

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: Annemarie JoHo

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

What are the main measures and graphs of descriptive statistics? - Chapter 3

3.1 Which tables and graphs display data?

3.2 How do you describe the center of data using mean, median and mode?

3.3 How can you measure the variability of data?

3.4 How can you measure quartiles and other positions on a distribution?

3.5 How do you call statistics for multiple variables?

3.6 Which letters are used in formulas to mark the difference between the sample and the population?

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Statistics: selected suggestions, summaries and tips of WorldSupporters

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Statistics: selected suggestions, summaries and tips of WorldSupporters

Statistics: selected suggestions, summaries and tips of WorldSupporters

Selected contributions for Data: distributions, connections and gatherings

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance