Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Book summary
- 2176 reads
DIFFERENT TYPES OF DATA
A variable is any characteristic observed in a study. The data values that we observe for a variable are called observations. A variable can be categorical and quantitative.
Key features to describe quantitative variables are the centre and the variability (spread) of the data (e.g: average amount of hours spent watching TV every day). Key feature to describe categorical variables is the relative number of observations in various categories. (e.g: the percentage of days in a year that it was sunny)
Quantitative variables can be discrete and continuous. A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3 (e.g: the number of pets in a household). A quantitative variable is continuous if its possible values form an interval, such as 0.16, 0,13, 2,32 (e.g: weight: 68,3 kg).
The distribution of a variable describes how the observations fall (are distributed) across the range of possible values. The modal category is the category with the largest frequency.
A frequency table is a listing of possible values for a variable, together with the number of observations for each value.
Category | A | B | C |
Frequency | 17 | 23 | 9 |
Proportion | 0.347 | 0.469 | 0.184 |
Percentage | 34.7% | 46.9% | 18.4% |
*an example of a frequency table*
The proportion of observations falling in a certain category is the number of observations in that category divided by the total number of observations. The percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies.
Proportion =
GRAPHICAL SUMMARIES OF DATA
The two primary graphical displays for summarizing a categorical variable are the pie chart and the bar graph. A bar graph with categories ordered by their frequency is called a Pareto chart. The Pareto Principle states that a small subset of categories often contains most of the observations.
There are three common ways of summarizing quantitative variables and visualize their distribution.
It is wise to always plot a histogram when summarizing the data. If the amount of observations is small (less than 50), the histogram should be supplemented with a stem-and-leaf plot or a dot plot to show the numerical values of the observations. A unimodal distribution can be symmetric or skewed. If it is skewed, it can either be skewed to the right or to the left. The distribution is skewed if one side of the distribution stretches out longer than the other side. If the peak is at the left side, the distribution is skewed to the right.
A data set collected over time is called a time series. A common pattern to look for is a trend over time, indicating a tendency of the date to either rise or fall. Time series can be displayed in either a time plot or a bar graph.
The mean is the sum of observations divided by the number of observations. It is interpreted as the balance point of the distribution. The median is the middle value of the observations when observations are ordered from smallest to largest. Here are some basic properties of the mean:
The mean and the median can be compared. The shape of a distribution influences whether the mean is larger or smaller than the median.
A numerical summary of the observations is called resistant if extreme observations have little, if any, influence on its value. The median is resistant, the mean is not. If a distribution is highly skewed, the median is usually preferred over the mean. If the distribution is close to symmetric or only mildly skewed, the mean is usually preferred over the median.
The mode is the value that occurs most frequently. The mode is often used with categorical variables. It is possible that there is no mode with a continuous observation.
MEASURING THE VARIABILITY OF QUANTITATIVE DATA
The deviation of an observation x from the mean, the difference between the observation and the sample mean. The sum of the deviations always equals zero. The average of the squared deviation is called the variance. The root of the variance (squared deviation) is called the standard deviation. This represents a typical distance or a type of average distance of an observation from the mean. The greater the standard deviation ‘s’, the greater the variability in the data. ‘s’ can only be 0 when all the observations take the same value.
The standard deviation: s=∑(x-x̄)2n-1
This means: the square root of (the sum of squared deviations divided by sample size – 1)
The mean and median describe the centre of the distribution. The standard deviation and the range describe the variability of the distribution.
USING MEASURES OF POSITION TO DESCRIBE VARIABILITY
The median is a special case of a more general set of measures of position called percentiles. The pth percentile is a value such that p percent of the observation fall below or at that value. Three useful percentiles are the quartiles. (1st quartile: p = 25, 2nd quartile: p = 50 (median), 3rd quartile: p = 75)
The quartiles are also used to define a measure of variability that is more resistant than the range and the standard deviation. The distance from Q1 to Q3 is called the interquartile range. It is possible to identify possible outliers using the interquartile range. An observation is a potential outlier if the observation falls more than 1.5 x IQR below the first quartile or more than 1.5 x IQR above the third quartile.
The five number summary is the basis of a graphical display called the box plot. The box of a box plot contains the central 50% of the distribution, from the first quartile to the third quartile.
A box plot does not portray certain features of a distribution, such as distinct mounds and possible gaps, as clearly as a histogram does. Box plots are useful for identifying potential outliers. Side-by-side box plots are useful in comparing data, as it shows differences in centres, potential outliers and the variability.
The z-score is the number of standard deviation falls from the mean.
RECOGNIZING AND AVOIDING MISUSES OF GRAPHICAL SUMMARIES
The following things are useful when constructing a graph:
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
This bundle contains a full summary for the book "Statistics, the art and science of learning from data by A. Agresti (third edition". It contains the following chapters:
1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15.
Contents of this bundle:
This bundle contains a summary for the first interim exam of the course "Research Methods & Statistics" given at the University of Amsterdam. It contains the books: "Statistics, the art and science of
...There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
2385 | 1 |
Add new contribution