Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 2 summary

DIFFERENT TYPES OF DATA
A variable is any characteristic observed in a study. The data values that we observe for a variable are called observations. A variable can be categorical and quantitative.

  • Categorical variables are variables that belong to a distinct set of categories. A categorical variable can be numerical, because some variables do not vary in quantity. (e.g: religion, favourite sport, bank account, area codes)
  • Quantitative variables are variables that have numerical values and represent different magnitudes. (e.g: weight, height, hours spent watching TV every day)

Key features to describe quantitative variables are the centre and the variability (spread) of the data (e.g: average amount of hours spent watching TV every day). Key feature to describe categorical variables is the relative number of observations in various categories. (e.g: the percentage of days in a year that it was sunny)

Quantitative variables can be discrete and continuous. A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3 (e.g: the number of pets in a household). A quantitative variable is continuous if its possible values form an interval, such as 0.16, 0,13, 2,32 (e.g: weight: 68,3 kg).

The distribution of a variable describes how the observations fall (are distributed) across the range of possible values. The modal category is the category with the largest frequency.

A frequency table is a listing of possible values for a variable, together with the number of observations for each value.

Category

A

B

C

Frequency

17

23

9

Proportion

0.347

0.469

0.184

Percentage

34.7%

46.9%

18.4%

*an example of a frequency table*

The proportion of observations falling in a certain category is the number of observations in that category divided by the total number of observations. The percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies.

Proportion =  

GRAPHICAL SUMMARIES OF DATA
The two primary graphical displays for summarizing a categorical variable are the pie chart and the bar graph. A bar graph with categories ordered by their frequency is called a Pareto chart. The Pareto Principle states that a small subset of categories often contains most of the observations.

 

 

There are three common ways of summarizing quantitative variables and visualize their distribution.

  1. Dot plot
    A dot plot shows a dot for each observation, placed just above the value on the number line for that observation.
  2. Stem-and-Leaf plots
    A stem-and-leaf plot represents each observation by a stem and a leaf. The stem usually consists of all the digits except for the final one, which is the leaf. It is possible to truncate the data values: cut off the final digit without having to round it.
  3. Histogram
    A histogram is a graph that uses bars to portray frequencies or the relative frequencies of the possible outcomes for a quantitative variable. A histogram can be unimodal and bimodal. If the distribution has a single mound or peak it is called unimodal, if it has two distinct mounds or peaks, then it is called bimodal.

It is wise to always plot a histogram when summarizing the data. If the amount of observations is small (less than 50), the histogram should be supplemented with a stem-and-leaf plot or a dot plot to show the numerical values of the observations. A unimodal distribution can be symmetric or skewed. If it is skewed, it can either be skewed to the right or to the left. The distribution is skewed if one side of the distribution stretches out longer than the other side. If the peak is at the left side, the distribution is skewed to the right.

A data set collected over time is called a time series. A common pattern to look for is a trend over time, indicating a tendency of the date to either rise or fall. Time series can be displayed in either a time plot or a bar graph.

The mean is the sum of observations divided by the number of observations. It is interpreted as the balance point of the distribution. The median is the middle value of the observations when observations are ordered from smallest to largest. Here are some basic properties of the mean:

  • The mean is the balance point of data.
  • The mean is often not equal to any value that was observed in the sample.
  • For a skewed distribution, the mean is pulled in the direction of the longer tail, relative to the median.
  • The mean can be highly influenced by an outlier, an unusual small or an unusual large observation.

The mean and the median can be compared. The shape of a distribution influences whether the mean is larger or smaller than the median.

  • If the distribution is perfectly symmetric, the mean equals the median
  • If the distribution is skewed to the left, the mean is smaller than the median.
  • If the distribution is skewed to the right, the mean is larger than the median.

A numerical summary of the observations is called resistant if extreme observations have little, if any, influence on its value. The median is resistant, the mean is not. If a distribution is highly skewed, the median is usually preferred over the mean. If the distribution is close to symmetric or only mildly skewed, the mean is usually preferred over the median.

The mode is the value that occurs most frequently. The mode is often used with categorical variables. It is possible that there is no mode with a continuous observation.

MEASURING THE VARIABILITY OF QUANTITATIVE DATA
The deviation of an observation x from the mean, the difference between the observation and the sample mean. The sum of the deviations always equals zero. The average of the squared deviation is called the variance. The root of the variance (squared deviation) is called the standard deviation. This represents a typical distance or a type of average distance of an observation from the mean. The greater the standard deviation ‘s’, the greater the variability in the data. ‘s’ can only be 0 when all the observations take the same value.   

The standard deviation: s=∑(x-)2n-1  

This means: the square root of (the sum of squared deviations divided by sample size – 1)

The mean and median describe the centre of the distribution. The standard deviation and the range describe the variability of the distribution.

USING MEASURES OF POSITION TO DESCRIBE VARIABILITY
The median is a special case of a more general set of measures of position called percentiles. The pth percentile is a value such that p percent of the observation fall below or at that value. Three useful percentiles are the quartiles. (1st quartile: p = 25, 2nd quartile: p = 50 (median), 3rd quartile: p = 75)

The quartiles are also used to define a measure of variability that is more resistant than the range and the standard deviation. The distance from Q1 to Q3 is called the interquartile range. It is possible to identify possible outliers using the interquartile range. An observation is a potential outlier if the observation falls more than 1.5 x IQR below the first quartile or more than 1.5 x IQR above the third quartile.

The five number summary is the basis of a graphical display called the box plot. The box of a box plot contains the central 50% of the distribution, from the first quartile to the third quartile.

A box plot does not portray certain features of a distribution, such as distinct mounds and possible gaps, as clearly as a histogram does. Box plots are useful for identifying potential outliers. Side-by-side box plots are useful in comparing data, as it shows differences in centres, potential outliers and the variability.

The z-score is the number of standard deviation falls from the mean.

RECOGNIZING AND AVOIDING MISUSES OF GRAPHICAL SUMMARIES
The following things are useful when constructing a graph:

  • Label both axes and provide a heading to make clear what the graph is intended to portray
  • The vertical axis usually starts at 0
  • Make sure you don’t get the relative percentages incorrect
  • Sometimes it is useful to use multiple graphs to compensate for the relative difference

 

Image

Access: 
Public

Image

Join WorldSupporter!
This content is used in:

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Book summary

Research Methods & Statistics – Interim exam 1 (UNIVERSITY OF AMSTERDAM)

Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Activities abroad, study fields and working areas:
This content is also used in .....

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: JesperN
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
2374 1