Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 2 summary

DIFFERENT TYPES OF DATA
A variable is any characteristic observed in a study. The data values that we observe for a variable are called observations. A variable can be categorical and quantitative.

  • Categorical variables are variables that belong to a distinct set of categories. A categorical variable can be numerical, because some variables do not vary in quantity. (e.g: religion, favourite sport, bank account, area codes)
  • Quantitative variables are variables that have numerical values and represent different magnitudes. (e.g: weight, height, hours spent watching TV every day)

Key features to describe quantitative variables are the centre and the variability (spread) of the data (e.g: average amount of hours spent watching TV every day). Key feature to describe categorical variables is the relative number of observations in various categories. (e.g: the percentage of days in a year that it was sunny)

Quantitative variables can be discrete and continuous. A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3 (e.g: the number of pets in a household). A quantitative variable is continuous if its possible values form an interval, such as 0.16, 0,13, 2,32 (e.g: weight: 68,3 kg).

The distribution of a variable describes how the observations fall (are distributed) across the range of possible values. The modal category is the category with the largest frequency.

A frequency table is a listing of possible values for a variable, together with the number of observations for each value.

Category

A

B

C

Frequency

17

23

9

Proportion

0.347

0.469

0.184

Percentage

34.7%

46.9%

18.4%

*an example of a frequency table*

The proportion of observations falling in a certain category is the number of observations in that category divided by the total number of observations. The percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies.

Proportion =  

GRAPHICAL SUMMARIES OF DATA
The two primary graphical displays for summarizing a categorical variable are the pie chart and the bar graph. A bar graph with categories ordered by their frequency is called a Pareto chart. The Pareto Principle states that a small subset of categories often contains most of the observations.

 

 

There are three common ways of summarizing quantitative variables and visualize their distribution.

  1. Dot plot
    A dot plot shows a dot for each observation, placed just above the value on the number line for that observation.
  2. Stem-and-Leaf plots
    A stem-and-leaf plot represents each observation by a stem and a leaf. The stem usually consists of all the digits except for the final one, which is the leaf. It is possible to truncate the data values: cut off the final digit without having to round it.
  3. Histogram
    A histogram is a graph that uses bars to portray frequencies or the relative frequencies of the possible outcomes for a quantitative variable. A histogram can be unimodal and bimodal. If the distribution has a single mound or peak it is called unimodal, if it has two distinct mounds or peaks, then it is called bimodal.

It is wise to always plot a histogram when summarizing the data. If the amount of observations is small (less than 50), the histogram should be supplemented with a stem-and-leaf plot or a dot plot to show the numerical values of the observations. A unimodal distribution can be symmetric or skewed. If it is skewed, it can either be skewed to the right or to the left. The distribution is skewed if one side of the distribution stretches out longer than the other side. If the peak is at the left side, the distribution is skewed to the right.

A data set collected over time is called a time series. A common pattern to look for is a trend over time, indicating a tendency of the date to either rise or fall. Time series can be displayed in either a time plot or a bar graph.

The mean is the sum of observations divided by the number of observations. It is interpreted as the balance point of the distribution. The median is the middle value of the observations when observations are ordered from smallest to largest. Here are some basic properties of the mean:

  • The mean is the balance point of data.
  • The mean is often not equal to any value that was observed in the sample.
  • For a skewed distribution, the mean is pulled in the direction of the longer tail, relative to the median.
  • The mean can be highly influenced by an outlier, an unusual small or an unusual large observation.

The mean and the median can be compared. The shape of a distribution influences whether the mean is larger or smaller than the median.

  • If the distribution is perfectly symmetric, the mean equals the median
  • If the distribution is skewed to the left, the mean is smaller than the median.
  • If the distribution is skewed to the right, the mean is larger than the median.

A numerical summary of the observations is called resistant if extreme observations have little, if any, influence on its value. The median is resistant, the mean is not. If a distribution is highly skewed, the median is usually preferred over the mean. If the distribution is close to symmetric or only mildly skewed, the mean is usually preferred over the median.

The mode is the value that occurs most frequently. The mode is often used with categorical variables. It is possible that there is no mode with a continuous observation.

MEASURING THE VARIABILITY OF QUANTITATIVE DATA
The deviation of an observation x from the mean, the difference between the observation and the sample mean. The sum of the deviations always equals zero. The average of the squared deviation is called the variance. The root of the variance (squared deviation) is called the standard deviation. This represents a typical distance or a type of average distance of an observation from the mean. The greater the standard deviation ‘s’, the greater the variability in the data. ‘s’ can only be 0 when all the observations take the same value.   

The standard deviation: s=∑(x-)2n-1  

This means: the square root of (the sum of squared deviations divided by sample size – 1)

The mean and median describe the centre of the distribution. The standard deviation and the range describe the variability of the distribution.

USING MEASURES OF POSITION TO DESCRIBE VARIABILITY
The median is a special case of a more general set of measures of position called percentiles. The pth percentile is a value such that p percent of the observation fall below or at that value. Three useful percentiles are the quartiles. (1st quartile: p = 25, 2nd quartile: p = 50 (median), 3rd quartile: p = 75)

The quartiles are also used to define a measure of variability that is more resistant than the range and the standard deviation. The distance from Q1 to Q3 is called the interquartile range. It is possible to identify possible outliers using the interquartile range. An observation is a potential outlier if the observation falls more than 1.5 x IQR below the first quartile or more than 1.5 x IQR above the third quartile.

The five number summary is the basis of a graphical display called the box plot. The box of a box plot contains the central 50% of the distribution, from the first quartile to the third quartile.

A box plot does not portray certain features of a distribution, such as distinct mounds and possible gaps, as clearly as a histogram does. Box plots are useful for identifying potential outliers. Side-by-side box plots are useful in comparing data, as it shows differences in centres, potential outliers and the variability.

The z-score is the number of standard deviation falls from the mean.

RECOGNIZING AND AVOIDING MISUSES OF GRAPHICAL SUMMARIES
The following things are useful when constructing a graph:

  • Label both axes and provide a heading to make clear what the graph is intended to portray
  • The vertical axis usually starts at 0
  • Make sure you don’t get the relative percentages incorrect
  • Sometimes it is useful to use multiple graphs to compensate for the relative difference

 

Check page access:
Public
Check more or recent content:

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Book summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 1 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 1 summary

Image

USING DATA TO ANSWER STATISTICAL QUESTIONS
The information we gather with experiments and surveys is collectively called data. Statistics is the art and science of learning from data. Statistical problem solving consists of four things:

  1. Formulate a statistical question
  2. Collect data
  3. Analyse data
  4. Interpret results

The three main components of statistics for answering a statistical question are:

  1. Design
    Stating the goal and/or statistical question of interest and planning how to obtain data that will address them. (e.g: how do you conduct an experiment to determine the effects of ‘X’)
  2. Description
    Summarizing and analysing the data that are obtained (e.g: summarizing people’s tv-habits in ‘hours of tv watched per day’)
  3. Inference
    Making decisions and predictions based on the data for answering the statistical question. (predicting the outcome of an election, based on the description of the data)

Probability is a framework for quantifying how likely various possible outcomes are.

SAMPLE VERSUS POPULATION
The entities that are measured in a study are called the subjects. This usually means people, but it can also be schools, countries or days. The population is the set of all the subjects of interest. In practice, we usually have data for only some of the subjects who belong to that population. These subjects are called a sample.

Descriptive statistics refers to methods for summarizing the collected data. The summaries usually consist of graphs and numbers such as averages and percentages. Inferential statistics are used when data are available from a sample only, but we want to make a decision or prediction about the entire population. Inferential statistics refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population.

A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. The true parameter values are almost always unknown, thus we use sample statistics to estimate the parameter values.

A sample is random when everyone in the population has the same chance of being included in the sample. Random sampling allows us to make powerful inferences about populations. The margin of error is a measure of the expected variability from one random sample to the next random sample.

The formula for calculating the approximate margin of error is:  . In this case, ‘n’ is the number of subjects.

Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 2 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 2 summary

Image

DIFFERENT TYPES OF DATA
A variable is any characteristic observed in a study. The data values that we observe for a variable are called observations. A variable can be categorical and quantitative.

  • Categorical variables are variables that belong to a distinct set of categories. A categorical variable can be numerical, because some variables do not vary in quantity. (e.g: religion, favourite sport, bank account, area codes)
  • Quantitative variables are variables that have numerical values and represent different magnitudes. (e.g: weight, height, hours spent watching TV every day)

Key features to describe quantitative variables are the centre and the variability (spread) of the data (e.g: average amount of hours spent watching TV every day). Key feature to describe categorical variables is the relative number of observations in various categories. (e.g: the percentage of days in a year that it was sunny)

Quantitative variables can be discrete and continuous. A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3 (e.g: the number of pets in a household). A quantitative variable is continuous if its possible values form an interval, such as 0.16, 0,13, 2,32 (e.g: weight: 68,3 kg).

The distribution of a variable describes how the observations fall (are distributed) across the range of possible values. The modal category is the category with the largest frequency.

A frequency table is a listing of possible values for a variable, together with the number of observations for each value.

Category

A

B

C

Frequency

17

23

9

Proportion

0.347

0.469

0.184

Percentage

34.7%

46.9%

.....read more
Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

Image

THE ASSOCIATION BETWEEN TWO CATEGORICAL VARIABLES
When analysing data the first step is to distinguish between the response variable and the explanatory variable. The response variable is the outcome variable on which comparisons are made. If the explanatory variable is categorical, it defines the groups to be compared with respect to values for the response variable. If the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to values for the response variable. The explanatory variable should explain the response variable (e.g: survival status is a response variable and smoking status is the explanatory variable).

An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.

A contingency table is a display for two categorical variables. Conditional proportions are proportions which formation is conditional on ‘x’. A conditional proportion should be conditional to something. A conditional proportion is also a percentage. The proportion of the totals (e.g: percentage of total amount of ‘no’) is called a marginal proportion.

There is probably an association between two variables if there is a clear explanatory/response relationship, that dictates which way we compute the conditional proportions. Conditional proportions are useful in determining if there’s an association. A variable can be independent from another variable.

THE ASSOCIATION BETWEEN TWO QUANTITATIVE VARIABLES
We examine a scatterplot to study association. There is a difference between a positive association and a negative association. If there is a positive association, x goes up as y goes up. If there is a negative association, x goes up as y goes down.

Correlation describes the strength of the linear association. Correlation (r) summarizes th direction of the association between two quantitative variables and the strength of its linear trend. It can take a value between -1 and 1. A positive value for r indicates a positive association and a negative value for r indicates a negative association. The closer r is to 1, the closer the data points fall to a straight line and the stronger the linear association is. The closer r is to 0, the weaker the linear association is.

The properties of the correlation:

  • The correlation always falls between -1 and +1.
  • A positive correlation indicates a positive association and a negative correlation indicates a negative association.
  • The value of the correlation does not depend on the variables’ unit (e.g: euros or dollars)
  • Two variables have the same correlation no matter which is treated as the response variable and which is treated at the explanatory variable.
 

 

The correlation r can be calculated as following:

N is the number of points.  and ȳ are means and

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 5 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 5 summary

Image

HOW PROBABILITY QUANTIFIES RANDOMNESS
Probability is the way we quantify uncertainness. It measures the chances of the possible outcomes for random phenomena. A random phenomenon is an everyday occurrence for which the outcome is uncertain. With random phenomena, the proportion of times that something happens is highly random and variable in the short run, but very predictable in the long run. The law of large numbers states that if the number of trials increases, the proportion of occurrences of any outcome approaches a given number. The probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations.

Different trials of a random phenomena are independent if the outcome of any one trial is not affected by the outcome of any other trial (e.g: if you have three children who are boys, the chance of the next child being a girl is not higher, but still ½).

In the subjective definition of probability, the probability is not based on objective data, but rather subjective information. The probability of an outcome is defined to be a personal probability. This is called Bayesian statistics.

FINDING PROBABILITIES
The sample space is the set of all possible outcomes (e.g: with being pregnant, the sample space is: {boy, girl}). An event is a subset of the sample space. An event corresponds to a particular outcome or a group of possible outcomes (e.g: a particular outcome or a group of possible outcomes). The probability of an event has the following formula:

For example, if you want to know the probability of the event throwing 6 with a fair dice, you calculate it like this:

  • Number of outcomes in event A: 1 (there is only one possibility to throw 6)
  • Number of outcomes in the sample space: 6 (you can throw between 1 and 6)
  • P(A) = 1/6

The rest of the sample space for event A is called the complement of A. The complement of an event consists of all outcomes in the sample space that are not in the event.

Events that do not share any outcomes in common are disjoint (e.g: two events, A and B, are disjoint if they do not have any common outcomes). The chance that in the case of two events, A and B, both occur is called the intersection. The event that the outcome is A or B is the union of A and B.

 

 

There are three general rules for calculating the probabilities:

  1. Complement rule
  2. Addition rule
    There are two parts of the addition rule. For the union of two events:

    If the events are disjoint:
.....read more
Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 6 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 6 summary

Image

SUMMARIZING POSSIBLE OUTCOMES AND THEIR PROBABILITIES
All possible outcomes and probabilities are summarized in a probability distribution. There is a normal and a binomial distribution. A random variable is a numerical measurement of the outcome of a random phenomenon. The probability distribution of a discrete random variable assigns a probability to each possible value. Numerical summaries of the population are called parameters and a population distribution is a type of probability distribution, one that applies for selecting a subject at random from a population.

The formula for the mean of a probability distribution for a discrete random variable is:

μ= ΣxP(x)

It is also called a weighted average, because some outcomes are likelier to occur than others, so a regular mean would be insufficient here. The mean of a probability distribution of random variable X is also called the expected value of X. The standard deviation of a probability distribution measures the variability from the mean. It describes how far values of the random variable fall, on the average, from the expected value of the distribution. A continuous variable is measured in a discrete manner, because of rounding. A probability distribution for a continuous random variable is used to approximate the probability distribution for the possible rounded values.

PROBABILITIES FOR BELL-SHAPED DISTRIBUTIONS

The z-score for a value x of a random variable is the number of standard deviations that x falls from the mean. It is calculated as:

 

The standard normal distribution is the normal distribution with mean  and standard deviation  . It is the distribution of normal z-scores.

PROBABILITIES WHEN EACH OBSERVATION HAS TWO POSSIBLE OUTCOMES
An observation is binary if it has one of two possible outcomes (e.g: accept or decline, yes or no). A random variable X that counts the number of observations of a particular type has a probability distribution called the binomial distribution. There are a few conditions for a binomial distribution:

  1. Two possible outcomes
    Each trial has two possible outcomes.
  2. Same probability of success
    Each trial has the same probability of success
  3. Trials are independent

The formula for the binomial probabilities for any n is:

The binomial distribution is valid if the sample size is less than 10% of the population. There are a couple of formulas for the binomial distribution:

 and

 

Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 7 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 7 summary

Image

HOW SAMPLE PROPORTIONS VARY AROUND THE POPULATION PROPORTION
The sample distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take. The population distribution from which you take the sample. Values of its parameters are fixed, but usually unknown. Data distribution is the distribution of the sample data. It is also called sample proportion. Sampling distribution is the distribution of a sample statistic such as a sample proportion. Sampling distributions describe the variability that occurs from sample to sample.

For a random sample size n from a population with proportion p of outcomes in a particular category, the sampling distribution of the sample proportion in that category has:

 and 

For a large sample size n, the binomial distribution has a normal distribution. The central limit theorem states that the sampling distribution of the sample mean often has approximately a normal distribution. This result applies no matter what the shape of the population distribution from which the samples are taken. The standard deviation of the sampling distribution has the following formula:

The larger the sample, the closer the sample mean tends to fall to the population mean.

Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 8 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 8 summary

Image

POINT AND INTERVAL ESTIMATES OF POPULATION PARAMETERS
A point estimate is a single number that is our best guess for the parameter (e.g: 25% of all Dutch people are above 1,80m). An interval estimate is an interval of numbers within which the parameter value is believed to fall (e.g: between 20% and 30% of the Dutch people are above 1,80m). The margin of error gives the lower border and the upper border of the margin.

A good estimator of a parameter has two properties:

  1. Unbiased
    A good estimator has a sampling distribution that is centred at the parameter. A mean from a random sample should fall around the population parameter and this is especially the case with multiple samples and thus a sampling distribution.
  2. Small standard deviation
    A good estimator has a small standard deviation compared to other estimators. The sample mean is preferred over the sample median, even in a normal distribution, because the sample mean has a smaller standard deviation.

An interval estimate is designed to contain the parameter with some chosen probability, such as 0.95. Confidence intervals are interval estimates that contain the parameter with a certain degree of confidence. A confidence interval is an interval containing the most believable values for a parameter. The probability that this method produces an interval that contains the parameter is called the confidence level. A sampling distribution of a sample proportion gives the possible values for the sample proportion and their probabilities and is a normal distribution if np is larger than 15 and n(1-p) is larger than 15. The margin of error measures how accurate the point estimate is likely to be in estimating a parameter.

CONSTRUCTING A CONFIDENCE INTERVAL TO ESTIMATE A POPULATION PROPORTION
The point estimate of the population proportion is the sample proportion. The standard error is the estimated standard deviation of a sampling distribution. The formula for the standard error is:

The greater the confidence level, the greater the interval. The margin of error decreases with bigger samples, because the standard error decreases with bigger samples. The larger the sample, the narrower the interval. If using a 95% confidence interval over time, then 95% of the intervals would give correct results, containing the population proportion.

CONSTRUCTING A CONDIFENCE INTERVAL TO ESTIMATE A POPULATION MEAN
The standard error for the population mean has the following formula:

The t-score is like a z-score, but a bit larger, and comes from a bell-shaped distribution that has slightly thicker tails than a normal distribution. The distribution that uses the t-score and the standard error, rather than the z-score and the standard deviation is called the t-distribution. The standard deviation of the t-distribution is a bit larger than 1, with the precise value depending on what is called the degrees of freedom. The t-score has

.....read more
Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 9 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 9 summary

Image

STEPS FOR PERFORMING A SIGNIFICANCE TEST
A hypothesis is a statement about the population. A significance test is a method for using data to summarize the evidence about a hypothesis. The null hypothesis (H0) is a statement that the parameter takes a particular value (e.g: probability of getting a baby girl: p = 0.482). The alternative hypothesis (Ha) states that the parameter falls in some alternative range of values. A significance test has five steps:

  1. Assumptions
    Each significance test has certain assumptions or has certain condition under which it applies (e.g: an assumption is the assumption that random sampling has been used).
  2. Hypotheses
    Each significance test has two hypotheses about a population parameter. The null hypothesis and the alternative hypothesis.
  3. Test statistic
    The parameter to which the hypotheses refer has a point estimate. A test statistic describes how far that point estimate falls from the parameter value given in the null hypothesis. This is usually measured in number of standard errors between the point estimate and the parameter.
  4. P-value
    A probability summary of the evidence against the null hypothesis is used to interpret a test statistic. The P-value is the probability that the test statistic equals the observed value or a value even more extreme. It is calculated by presuming that the null hypothesis is true.
  5. Conclusion
    The conclusion of the significance test reports the P-value and interprets what is says about the question that motivated the test.

SIGNIFICANCE TESTS ABOUT PROPORTIONS
The steps for the significance test are the same for proportions. The biggest assumption made here is that the sample size is large enough that the sampling distribution is approximately normal. The hypotheses are the following for significance tests about proportions:

 and or

This is called a one-sided alternative hypothesis, because it has values falling only on one side of the null hypothesis value. A two-sided alternative hypothesis has the form of:

The test statistic of a significance test about proportions is:

 or

The P-value of a test statistic of a significance test about proportions is the left- or right-tail probability of a test statistic value even more extreme than observed. Smaller P-values indicate stronger evidence against the null hypothesis, because the data would be more unusual if the null hypothesis were true. In a two-sides test, the P-value is the probability of a single tail doubled. The significance level is a number such that we reject H0 if the P-value is less than or equal to that number. The most common significance level is 0.05. If the data provide evidence to reject H0 and accept Ha, the data is called statistically significant. If Ha is rejected, this does not mean that

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 10 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 10 summary

Image

CATEGORICAL RESPONSE: COMPARING TWO PROPORTIONS:
Bivariate methods is the general category of statistical methods used when we have two variables. The outcome variable on which comparisons are made is called the response variable. The binary variable that specifies the groups is the explanatory variable. In an independent sample, observations in one sample are independent from observations in another sample. If two samples have the same subjects, they are dependent. If each subject in one sample is matched with a subject in another sample there are matched pairs and the data is dependent as well.

The formula for the standard error for comparing two proportions is:

A 95% confidence interval for the difference between two population proportions has the following formula:

The proportion (p̂) is called a pooled estimate, since it pools the total number of successes and total number of observations from two samples. This uses the presumption p1=p2. The test statistic uses the following formula:

The standard error for the test statistic uses the following formula:

QUANTITATIVE RESPONSE: COMPARING TWO MEANS:
The standard error for comparing two means has the following formula:

A 95% confidence interval for the difference between two population means has the following formula:

The confidence interval for the difference between two population means uses the t-distribution and not the z-distribution. Interpreting a confidence interval for the difference of means uses the following criteria:

  1. Check whether or not 0 falls in the interval
    If it does, it could be that mean 1 is mean 2.
  2. Positive confidence interval suggests that mean 1 – mean 2 is positive
    If the confidence interval only contains positive numbers, this suggests that mean 1 – mean 2 is positive. This suggests that mean 1 is larger than mean 2.
  3. Negative confidence interval suggests that mean 1 – mean 2 is negative
    If the confidence interval only contains negative numbers, this suggests that mean 1 -  mean 2 is negative. This suggests that mean 1 is smaller than mean 2.
  4. Group order is arbitrary
    It is arbitrary whether one group is group one or the other.

The test statistic of a significance test comparing two population means uses the following formula:

It uses minus zero because the null hypothesis is that there is no difference between the groups and is thus zero.

OTHER WAYS OF COMPARING MEANS AND COMPARING PROPORTIONS
If it is reasonable to expect that the variability as

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 11 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 11 summary

Image

INDEPENDENCE AND DEPENDENCE (ASSOCIATION)
Conditional percentages refer to a sample data distribution, conditional on a category. They form the conditional distribution. If the probabilities for two different categorical variables are the same in the same category, then these variables are independent. If the probabilities for two different categorical variables differ, then these variables are dependent. Dependence refers to the population, so if there is barely any difference between two categorical variables in a sample, it could be independent, even though they differ.

TESTING CATEGORICAL VARIABLES FOR INDEPENDENCE
The expected cell count is the mean of the distribution for the count in any particular cell. The formula for the expected cell count is the following:

The chi-squared statistic summarizes how far the observed cell counts in a contingency table fall from the expected cell counts for a null hypothesis. It is the test statistic for the test of independence. The formula for the chi-squared statistic is:

The sampling distribution using the chi-squared statistic is called the chi-squared probability distribution. The chi-squared probability distribution has several properties:

  1. Always positive
  2. Shape depends on degrees of freedom
  3. Mean equals degrees of freedom
  4. As degrees of freedom increases the distribution becomes more bell shaped
  5. Large chi-square is evidence against independence

The degrees of freedom in a table with r rows and c columns can be calculated as following:

If a response variable is identified and the population conditional distributions are identical, they are said to be homogeneous. The chi-squared test is then referred to as a test of homogeneity. The degrees of freedom value in a chi-squared test indicates how many parameters are needed to determine all the comparisons for describing the contingency table. The chi-squared test can test for independence, but it cannot provide information about the strength and the direction of the associations and provide information about the practical significance, only about the statistical significance. When testing particular proportion values for a categorical variable, the chi-squared statistic is referred to as a goodness-of-fit statistic. The statistic summarizes how well the hypothesized values predict what happens with the observed data.

DETERMINING THE STRENGTH OF THE ASSOCIATION
A measure of association is a statistic or a parameter that summarizes the strength of the dependence between two variables. The association can be measured by looking at the difference of two associations. The formula for the difference of the two proportions is the following:

The ratio of two proportions is also a measure of association. This is also called the relative risk. The relative risk uses the following formula:

The relative risk has several properties:

  1. The relative risk can equal any non-negative number
  2. When p1=p2, the variables
.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 12 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 12 summary

Image

MODEL HOW TWO VARIABLES ARE RELATED
A regression line is a straight line that predicts the value of a response variable ‘y’ from the value of an explanatory variable ‘x’. The correlation is a summary measure of association. The regression line uses the following formula:

The data is plotted before a regression line is made, because it can be strongly influenced by outliers. The regression equation is often called a prediction equation. The difference between y - ŷ, between an observed outcome y and its predicted value ŷ is the prediction error, called the residual. The average of the residuals is zero. The regression line has a smaller sum of squared residuals than any other line. It is called the least squares line. The population regression equation has the following formula:

This formula is a model. A model is a simple approximation for how variables relate in a population. The probability distributions of y values at a fixed value of x is a conditional distribution (e.g: the means of annual income for people with 12 years of education).

DESCRIBE STRENGTH OF ASSOCIATION
Correlation does not differentiate between response and explanatory variables. The formula for the slope uses the correlation and can be calculated as following:

Using this formula, the y-intercept can be calculated:

The slope can’t be used to determine the strength of the association, because it determines on the units of measurement. The correlation is the standardized version of the slope. The formula for the correlation is the following:

A property of the correlation is that at any particular x value, the predicated value of y is relatively closer to its mean than x is to its mean. If a particular ‘x’ value falls 2.0 standard deviations from the mean with a correlation of 0.80, then the predicted ‘y’ is ‘r’ times that many standard deviations from its mean, so the predicted ‘y’ would be 0.80 times 2.0 standard deviations from the mean. The predicted ‘y’ is relatively closer to its mean than ‘x’ is to its mean. This is regression toward the mean. If the first observation is extreme, the second observation will be more toward the mean and will be less extreme.

Predicting ‘y’ using ‘x’ with the regression equation is called the residual sum of squares and this uses the following formula:

The measure r squared is interpreted as proportional reduction in error (e.g: if r squared = 0.40, the error using y-hat to predict y is 40% smaller than the error using y-bar to predict y). The formula for r squared

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 14 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 14 summary

Image

ONE-WAY ANOVE: COMPARING SEVERAL MEANS
The inferential method for comparing means of several groups is called analysis of variance, also called ANOVA. Categorical explanatory variables in multiple regression and in ANOVA are referred to as factors, also known as independent variables. An ANOVA with only one independent variable is called a one-way ANOVA.

Evidence against the null hypothesis in an ANOVA test is stronger when the variability within each sample is smaller or when the variability between groups is larger. The formula for the F (ANOVA) test is:

When the null hypothesis is true, the mean of the F-distribution is approximately 1. If the null hypothesis is wrong, then F>1. This also increases if the sample size increases. The larger the F-statistic, the smaller the P-value. The F-distribution has two degrees of freedom values:

 and

The ANOVA test has five steps:

  1. Assumptions
    A quantitative response variable for more than two groups. Independent random samples. Normal population distribution with equal standard deviation.
  2. Hypotheses

  3. Test statistic
    y
  4. P-value
    This is the right-tail probability of the observed F-value.
  5. Conclusion
    The null hypothesis is normally rejected if the P-value is smaller than 0.05.

If the sample sizes are equal, the within-groups estimate of the variance is the mean of the g sample variances for the g groups. It uses the following formula:

 

If the sample sizes are equal, the between-groups estimate of the variance uses the following formula:

The ANOVA F-test is robust to violations if the sample size is large enough. If the population sample sizes are not equal, the F test works quite well as long as the largest group standard deviation is no more than about twice the smallest group standard deviation. Disadvantages of the F-test are that it tells us whether groups are different, but it does not tell us which groups are different.

ESTIMATING DIFFERENCES IN GROUPS FOR A SINGLE FACTOR
The F-test only tells us if groups are different, not how different and which groups are different. Confident intervals can. A confidence interval for comparing means uses the following formula:

The degrees of freedom for the confidence interval is:

If the confidence interval does not contain 0, we can infer that the population means are different. Methods that control the probability that all confidence intervals will contain the true differences in means are called multiple comparison methods. Multiple comparison methods compare pairs of means with a confidence level that applies simultaneously to the

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 15 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 15 summary

Image

COMPARE TWO GROUPS BY RANKING
Nonparametric statistical methods are inferential methods that do not assume a particular form of distribution (e.g: the assumption of a normal distribution) for the population distribution. The Wilcoxon test is the best known nonparametric method. Nonparametric methods are useful when the data are ranked and when the assumption of normality is inappropriate.

The Wilcoxon test sets up a distribution using the probability of each difference of the mean rank. This test has five steps:

  1. Assumptions
    Independent random samples from two groups.
  2. Hypotheses


  3. Test statistic
    This is the difference between the sample mean ranks for the two groups.
  4. P-value
    This is a one-tail or two-tail probability, depending on the alternative hypothesis.
  5. Conclusion
    The null hypothesis is either rejected in favour of the alternative hypothesis or not.

The sum of the ranks can also be used, instead of the mean of the ranks. When conducting the Wilcoxon test, a z-test can also be conducted if the sample is large enough. This z-test has the following formula:

A Wilcoxon test can also be conducted by converting quantitative observations to ranks. The Wilcoxon test is not affected by outliers (e.g: an extreme outlier gets the lowest/highest rank, no matter if it’s a bit higher or lower than the number before that). The difference between the population medians can also be used if the distribution is highly skewed, but this requires the extra assumption that the population distribution of the two groups have the same shape. The point estimate of the difference between two medians equals the median of the differences between the two groups. A sample proportion can also be used, by checking what the proportion is of observations in group one that’s better than group two. If there is a proportion of 0.50, then there is no effect. The closer the proportion gets to 0 or 1, the greater the difference between the two groups.

NONPARAMETRIC METHODS FOR SEVERAL GROUPS AND FOR MATCHED PAIRS
The test for comparing mean ranks of more than two groups is called the Kruskal-Wallis test. This test has five steps:

  1. Assumptions
    Independent random samples.
  2. Hypotheses

  3. Test statistic
    The test statistic is based on the between-groups variability in the sample mean ranks. The test statistic uses the following formula:
    The test statistic has an approximate chi-squared distribution with g-1 degrees of freedom.
  4. P-value
    The right-tail probability above observed test statistic value from chi-squared distribution.
  5. Conclusion
    The null hypothesis is either rejected in favour of the alternative hypothesis or not.

It is

.....read more
Access: 
JoHo members

Research Methods & Statistics – Interim exam 1 (UNIVERSITY OF AMSTERDAM)

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 1 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 1 summary

Image

USING DATA TO ANSWER STATISTICAL QUESTIONS
The information we gather with experiments and surveys is collectively called data. Statistics is the art and science of learning from data. Statistical problem solving consists of four things:

  1. Formulate a statistical question
  2. Collect data
  3. Analyse data
  4. Interpret results

The three main components of statistics for answering a statistical question are:

  1. Design
    Stating the goal and/or statistical question of interest and planning how to obtain data that will address them. (e.g: how do you conduct an experiment to determine the effects of ‘X’)
  2. Description
    Summarizing and analysing the data that are obtained (e.g: summarizing people’s tv-habits in ‘hours of tv watched per day’)
  3. Inference
    Making decisions and predictions based on the data for answering the statistical question. (predicting the outcome of an election, based on the description of the data)

Probability is a framework for quantifying how likely various possible outcomes are.

SAMPLE VERSUS POPULATION
The entities that are measured in a study are called the subjects. This usually means people, but it can also be schools, countries or days. The population is the set of all the subjects of interest. In practice, we usually have data for only some of the subjects who belong to that population. These subjects are called a sample.

Descriptive statistics refers to methods for summarizing the collected data. The summaries usually consist of graphs and numbers such as averages and percentages. Inferential statistics are used when data are available from a sample only, but we want to make a decision or prediction about the entire population. Inferential statistics refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population.

A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. The true parameter values are almost always unknown, thus we use sample statistics to estimate the parameter values.

A sample is random when everyone in the population has the same chance of being included in the sample. Random sampling allows us to make powerful inferences about populations. The margin of error is a measure of the expected variability from one random sample to the next random sample.

The formula for calculating the approximate margin of error is:  . In this case, ‘n’ is the number of subjects.

Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 2 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 2 summary

Image

DIFFERENT TYPES OF DATA
A variable is any characteristic observed in a study. The data values that we observe for a variable are called observations. A variable can be categorical and quantitative.

  • Categorical variables are variables that belong to a distinct set of categories. A categorical variable can be numerical, because some variables do not vary in quantity. (e.g: religion, favourite sport, bank account, area codes)
  • Quantitative variables are variables that have numerical values and represent different magnitudes. (e.g: weight, height, hours spent watching TV every day)

Key features to describe quantitative variables are the centre and the variability (spread) of the data (e.g: average amount of hours spent watching TV every day). Key feature to describe categorical variables is the relative number of observations in various categories. (e.g: the percentage of days in a year that it was sunny)

Quantitative variables can be discrete and continuous. A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3 (e.g: the number of pets in a household). A quantitative variable is continuous if its possible values form an interval, such as 0.16, 0,13, 2,32 (e.g: weight: 68,3 kg).

The distribution of a variable describes how the observations fall (are distributed) across the range of possible values. The modal category is the category with the largest frequency.

A frequency table is a listing of possible values for a variable, together with the number of observations for each value.

Category

A

B

C

Frequency

17

23

9

Proportion

0.347

0.469

0.184

Percentage

34.7%

46.9%

.....read more
Access: 
Public
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

Image

THE ASSOCIATION BETWEEN TWO CATEGORICAL VARIABLES
When analysing data the first step is to distinguish between the response variable and the explanatory variable. The response variable is the outcome variable on which comparisons are made. If the explanatory variable is categorical, it defines the groups to be compared with respect to values for the response variable. If the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to values for the response variable. The explanatory variable should explain the response variable (e.g: survival status is a response variable and smoking status is the explanatory variable).

An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.

A contingency table is a display for two categorical variables. Conditional proportions are proportions which formation is conditional on ‘x’. A conditional proportion should be conditional to something. A conditional proportion is also a percentage. The proportion of the totals (e.g: percentage of total amount of ‘no’) is called a marginal proportion.

There is probably an association between two variables if there is a clear explanatory/response relationship, that dictates which way we compute the conditional proportions. Conditional proportions are useful in determining if there’s an association. A variable can be independent from another variable.

THE ASSOCIATION BETWEEN TWO QUANTITATIVE VARIABLES
We examine a scatterplot to study association. There is a difference between a positive association and a negative association. If there is a positive association, x goes up as y goes up. If there is a negative association, x goes up as y goes down.

Correlation describes the strength of the linear association. Correlation (r) summarizes th direction of the association between two quantitative variables and the strength of its linear trend. It can take a value between -1 and 1. A positive value for r indicates a positive association and a negative value for r indicates a negative association. The closer r is to 1, the closer the data points fall to a straight line and the stronger the linear association is. The closer r is to 0, the weaker the linear association is.

The properties of the correlation:

  • The correlation always falls between -1 and +1.
  • A positive correlation indicates a positive association and a negative correlation indicates a negative association.
  • The value of the correlation does not depend on the variables’ unit (e.g: euros or dollars)
  • Two variables have the same correlation no matter which is treated as the response variable and which is treated at the explanatory variable.
 

 

The correlation r can be calculated as following:

N is the number of points.  and ȳ are means and

.....read more
Access: 
JoHo members
Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 5 summary

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 5 summary

Image

HOW PROBABILITY QUANTIFIES RANDOMNESS
Probability is the way we quantify uncertainness. It measures the chances of the possible outcomes for random phenomena. A random phenomenon is an everyday occurrence for which the outcome is uncertain. With random phenomena, the proportion of times that something happens is highly random and variable in the short run, but very predictable in the long run. The law of large numbers states that if the number of trials increases, the proportion of occurrences of any outcome approaches a given number. The probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations.

Different trials of a random phenomena are independent if the outcome of any one trial is not affected by the outcome of any other trial (e.g: if you have three children who are boys, the chance of the next child being a girl is not higher, but still ½).

In the subjective definition of probability, the probability is not based on objective data, but rather subjective information. The probability of an outcome is defined to be a personal probability. This is called Bayesian statistics.

FINDING PROBABILITIES
The sample space is the set of all possible outcomes (e.g: with being pregnant, the sample space is: {boy, girl}). An event is a subset of the sample space. An event corresponds to a particular outcome or a group of possible outcomes (e.g: a particular outcome or a group of possible outcomes). The probability of an event has the following formula:

For example, if you want to know the probability of the event throwing 6 with a fair dice, you calculate it like this:

  • Number of outcomes in event A: 1 (there is only one possibility to throw 6)
  • Number of outcomes in the sample space: 6 (you can throw between 1 and 6)
  • P(A) = 1/6

The rest of the sample space for event A is called the complement of A. The complement of an event consists of all outcomes in the sample space that are not in the event.

Events that do not share any outcomes in common are disjoint (e.g: two events, A and B, are disjoint if they do not have any common outcomes). The chance that in the case of two events, A and B, both occur is called the intersection. The event that the outcome is A or B is the union of A and B.

 

 

There are three general rules for calculating the probabilities:

  1. Complement rule
  2. Addition rule
    There are two parts of the addition rule. For the union of two events:

    If the events are disjoint:
.....read more
Access: 
Public
Research methods in psychology by B. Morling (third edition) – Chapter 1 summary

Research methods in psychology by B. Morling (third edition) – Chapter 1 summary

Image

It is important to both produce and consume research. A research consumer is important, because to effectively know something or to put a theory or treatment in to use, it is imperative that the research consumer knows the evidence behind the evidence-based treatment. It is important to be able to decide how valuable and useful a research really is.

Both research producers and research consumers share an interest in psychological phenomena, such as behaviour or emotion. They also both share a commitment to the practice of empiricism: to answer psychological questions with systematic observations.

The cupboard theory is the idea that young animals (but also your dog) clings on to the caregiver because the caregiver provides food. The contact comfort theory is the idea that young animals (but also your dog) clings on to the caregiver because the caregiver provides warmth and contact comfort. These theories have been tested and followed the empirical cycle.

THE EMPIRICAL CYCLE

The empirical cycle always starts with an observation.

Induction -> Theory -> Deduction -> Prediction ->Testing -> Results -> Evaluation -> Observation - > Induction

  1. Observation
    You make an observation. This can be based on past research or an ‘every day method’.
  2. Induction
    This is the process of coming up with a theory that explains your observation. In this phase you research your research question.
  3. Theory
    After you’ve researched your research question you can find or come up with a theory. A theory is a set of statements that describe general principles about how variables relate to one another. A good theory is supported by data from previous studies, it should be falsifiable; it has to be possible to debunk the theory and a theory should not be unnecessarily complex. This is called parsimony. (preferring the simplest theory)
  4. Deduction
    This is the process of formulating a prediction that follows from your theory. You make an hypothesis: a predicted answer to your research question.
  5. Prediction
    A specific event that will occur if your hypothesis is true.
  6. Testing
    This is the process of verifying your prediction. You have to operationalize your test. This is determining how you will test your prediction.
  7. Results
    You have the results of your test.

Data are a set of observations. Depending on whether the data are consistent with hypotheses based on a theory data may either support or challenge a theory. The best theories should be supported by data from studies, should be parsimonious and falsifiable.

Basic research is used to enhance the general body of knowledge. Applied research is done with a practical problem in mind. Translational research is the dynamic bridge between basic and applied research. E.g: a basic research is about schizophrenia. Translational research is used to develop a new treatment for schizophrenia and applied research is used to see how people diagnosed with schizophrenia can fit better into today’s

.....read more
Access: 
Public
Research methods in psychology by B. Morling (third edition) – Chapter 2 summary

Research methods in psychology by B. Morling (third edition) – Chapter 2 summary

Image

EXPERIENCE
Experience is not a reliable source of information, because it has no comparison group. A comparison group in research is a group which isn’t affected by the controlled independent variable, so it is possible to really determine whether the independent variable has the effect people think it has.

E.g: Doctors used to take blood from an ill person, because they believed that it cured the illness. Some people recovered and they concluded that they recovered because they bled the patients. This is based on experience, they have experiences that some patients recovered, but they did not have a comparison group, so they had no way of knowing that the recovery was because of bleeding the patient. To make sure that it had this effect, they should have had a group with people who were ill, but were not bled, to see what would have happened.

When we are using personal experience to determine whether something works or not, we don’t have a comparison group as well. “My knee feels better with this tape”, but you don’t know how it would’ve felt if you didn’t use that tape. There is no comparison group, so it is not possible to give a conclusive answer, based on empirical evidence.

In real-world situation situations, there are several possible explanations for an outcome. In research, these alternative explanations are called confounds. Experience is confounded, because you do not know the cause of an effect, although you might think you do. When you use tape to lessen the pain in your knee, you don’t know whether the tape caused the pain to diminish. A researcher can see the situation from outside, but you can only see one condition and all you have is your experience.

Behavioural research is probabilistic. This means that it’s findings are not expected to explain all cases all the time. The conclusions of research are meant to explain a certain proportion of the cases. The two big problems with using experience as a source of information is that there is no comparison group and that experience is confounded.

INTUITION
People use their intuition to make decisions, although it is not a reliable source of information, because intuition is biased. There are ways our intuition is biased:

  1. Good Story bias
    People tend to believe a good story, but this doesn’t mean that is necessarily correct.
  2. Availability Heuristic
    Things that come to mind easily tend to guide our thinking. (e.g: Aeroflot is a bad airplane company, because the bad reports about Aeroflot come to mind easier than the good stories about Aeroflot) The availability heuristic occurs because sometimes things stand out more. (e.g: shark attacks stand out more than natural deaths, which causes us to believe that shark attacks are common)
  3. Present/Present bias
    This bias is the name for our failure to consider appropriate comparison groups. In this case there are comparison groups available, but you fail
.....read more
Access: 
Public
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Check more of this topic?
How to use more summaries?


Online access to all summaries, study notes en practice exams

Using and finding summaries, study notes en practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Starting Pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
  2. Use the menu above every page to go to one of the main starting pages
  3. Tags & Taxonomy: gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
  4. Follow authors or (study) organizations: by following individual users, authors and your study organizations you are likely to discover more relevant study materials.
  5. Search tool : 'quick & dirty'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject. The search tool is also available at the bottom of most pages

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study (main tags and taxonomy terms)

Field of study

Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private
Statistics
1934 1
Comments, Compliments & Kudos:

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
Promotions
Image
The JoHo Insurances Foundation is specialized in insurances for travel, work, study, volunteer, internships an long stay abroad
Check the options on joho.org (international insurances) or go direct to JoHo's https://www.expatinsurances.org

 

Follow the author: JesperN