What are the main measures and graphs of descriptive statistics? - Chapter 3

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.Example (relative) frequency distribution:GenderFrequenceProportionPercentageMale1500.4343%Female2000.5757%Total350 (=n)1.00100%Aside from tables also other visual displays are used...

Access options

How do you get full online access and services on JoHo WorldSupporter.org?

1 - Go to www JoHo.org, and join JoHo WorldSupporter by choosing a membership + online access

2 - Return to WorldSupporter.org and create an account with the same email address

3 - State your JoHo WorldSupporter Membership during the creation of your account, and you can start using the services

You have online access to all free + all exclusive summaries and study notes on WorldSupporter.org and JoHo.org
You can use all services on JoHo WorldSupporter.org (EN/NL)
You can make use of the tools for work abroad, long journeys, voluntary work, internships and study abroad on JoHo.org (Dutch service)

Already an account?

If you already have a WorldSupporter account than you can change your account status from 'I am not a JoHo WorldSupporter Member' into 'I am a JoHo WorldSupporter Member with full online access
Please note: here too you must have used the same email address.

Are you having trouble logging in or are you having problems logging in?

Read first the answers to the most frequently asked questions

Toegangsopties (NL)

Hoe krijg je volledige toegang en online services op JoHo WorldSupporter.org?

1 - Ga naar www JoHo.org, en sluit je aan bij JoHo WorldSupporter door een membership met online toegang te kiezen
2 - Ga terug naar WorldSupporter.org, en maak een account aan met hetzelfde e-mailadres
3 - Geef bij het account aanmaken je JoHo WorldSupporter membership aan, en je kunt je services direct gebruiken

Je hebt nu online toegang tot alle gratis en alle exclusieve samenvattingen en studiehulp op WorldSupporter.org en JoHo.org
Je kunt gebruik maken van alle diensten op JoHo WorldSupporter.org (EN/NL)
Op JoHo.org kun je gebruik maken van de tools voor werken in het buitenland, verre reizen, vrijwilligerswerk, stages en studeren in het buitenland

Heb je al een WorldSupporter account?

Wanneer je al eerder een WorldSupporter account hebt aangemaakt dan kan je, nadat je bent aangesloten bij JoHo via je 'membership + online access ook je status op WorldSupporter.org aanpassen
Je kunt je status aanpassen van 'I am not a JoHo WorldSupporter Member' naar 'I am a JoHo WorldSupporter Member with 'full online access'.
Let op: ook hier moet je dan wel hetzelfde email adres gebruikt hebben

Kom je er niet helemaal uit of heb je problemen met inloggen?

Lees dan eerst de antwoorden op de meest gestelde vragen

Join JoHo WorldSupporter!

What can you choose from?

JoHo WorldSupporter membership (= from €5 per calendar year):

To support the JoHo WorldSupporter and Smokey projects and to contribute to all activities in the field of international cooperation and talent development
To use the basic features of JoHo WorldSupporter.org

JoHo WorldSupporter membership + online access (= from €10 per calendar year):

To support the JoHo WorldSupporter and Smokey projects and to contribute to all activities in the field of international cooperation and talent development
To use full services on JoHo WorldSupporter.org (EN/NL)
For access to the online book summaries and study notes on JoHo.org and Worldsupporter.org
To make use of the tools for work abroad, long journeys, voluntary work, internships and study abroad on JoHo.org (NL service)

Register, become a JoHo member, and get your services

Sluit je aan bij JoHo WorldSupporter! (NL)

Waar kan je uit kiezen?

JoHo membership zonder extra services (donateurschap) = €5 per kalenderjaar

Voor steun aan de JoHo WorldSupporter en Smokey projecten en een bijdrage aan alle activiteiten op het gebied van internationale samenwerking en talentontwikkeling
Voor gebruik van de basisfuncties van JoHo WorldSupporter.org
Voor het gebruik van de kortingen en voordelen bij partners
Voor gebruik van de voordelen bij verzekeringen en reisverzekeringen zonder assurantiebelasting

JoHo membership met extra services (abonnee services): Online toegang Only= €10 per kalenderjaar

Voor volledige online toegang en gebruik van alle online boeksamenvattingen en studietools op WorldSupporter.org en JoHo.org
voor online toegang tot de tools en services voor werk in het buitenland, lange reizen, vrijwilligerswerk, stages en studie in het buitenland
voor online toegang tot de tools en services voor emigratie of lang verblijf in het buitenland
voor online toegang tot de tools en services voor competentieverbetering en kwaliteitenonderzoek
Voor extra steun aan JoHo, WorldSupporter en Smokey projecten

Meld je aan, wordt donateur en maak gebruik van de services

Check page access:

JoHo members

Join WorldSupporter!

Join with a free account for more service, or become a member for full access and support of WordSupporter

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.Read more

2427 reads

Check more or recent content:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

What are statistical methods? – Chapter 1

1.1 What is statistics and how can you learn it?
1.2 What is the difference between descriptive and inferential statistics?
1.3 What part does software play in statistics?

1.1 What is statistics and how can you learn it?

Statistics is used more and more often to study the behavior of people, not only by the social sciences but also by companies. Everyone can learn how to use statistics, even without much knowledge of mathematics and even with fear of statistics. Most important are logic thinking and perseverance.

To first step to using statistical methods is collecting data. Data are collected observations of characteristics of interest. For instance the opinion of 1000 people on whether marihuana should be allowed. Data can be obtained through questionnaires, experiments, observations or existing databases.

But statistics aren't only numbers obtained from data. A broader definition of statistics entails all methods to obtain and analyze data.

1.2 What is the difference between descriptive and inferential statistics?

Before being able to analyze data, a design is made on how to obtain the data. Next there are two sorts of statistical analyses; descriptive statistics and inferential statistics. Descriptive statistics summarizes the information obtained from a collection of data, so the data is easier to interpret. Inferential statistics makes predictions with the help of data. Which kind of statistics is used, depends on the goal of the research (summarize or predict).

To understand the differences better, a number of basic terms are important. The subjects are the entities that are observed in a research study, most often people but sometimes families, schools, cities etc. The population is the whole of subjects that you want to study (for instance foreign students). The sample is a limited number of selected subjects on which you will collect data (for instance 100 foreign students from several universities). The ultimate goal is to learn about the population, but because it's impossible to research the entire population, a sample is made.

Descriptive statistics can be used both in case data is available for the entire population and only for the sample. Inferential statistics is only applicable to samples, because predictions for a yet unknown future are made. Hence the definition of inferential statistics is making predictions about a population, based on data gathered from a sample.

The goal of statistics is to learn more about the parameter. The parameter is the numerical summary of the population, or the unknown value that can tell something about the ultimate conditions of the whole. So it's not about the sample but about the population. This is why an important part of

Access:

JoHo members

Which kinds of samples and variables are possible? – Chapter 2

2.1 Which kinds of variables can be measured?
2.2 How does randomization work?
2.3 How do you control variability and bias?
2.4 Which methods can be used for probability sampling?

2.1 Which kinds of variables can be measured?

All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.

The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.

Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.

The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.

The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain order. If a respondent indicates to be neutral, this is happier than considerably unhappy, which in turn is happier than unhappy. Important is that the distances between the values cannot be measured, this is the difference between ordinal and interval.

Quantitative variables have an interval or ratio scale. Interval means that there are measurable differences between the values. For instance temperate in Celcius. There is an order (30 degrees is more than 20) and the difference is clearly measurable and consistent.

The difference between interval and ratio is that for an interval scale the value can't be zero, but for a ratio scale it can be. So the ratio scale has numerical values, with a certain order, with measurable differences and with zero as a possible value. Examples are percentage or income.

Furthermore there are discrete and continuous variables. A variable is discrete when the possible values can only be limited, separate numbers. A variable is continuous when the values can be anything possible. For instance the number of brothers and sisters is discrete, because it's not possible to have 2.43 brother/sister. And for instance

Access:

JoHo members

What are the main measures and graphs of descriptive statistics? - Chapter 3

3.1 Which tables and graphs display data?
3.2 How do you describe the center of data using mean, median and mode?
3.3 How can you measure the variability of data?
3.4 How can you measure quartiles and other positions on a distribution?
3.5 How do you call statistics for multiple variables?
3.6 Which letters are used in formulas to mark the difference between the sample and the population?

3.1 Which tables and graphs display data?

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

Example (relative) frequency distribution:

Gender	Frequence	Proportion	Percentage
Male	150	0.43	43%
Female	200	0.57	57%
Total	350 (=n)	1.00	100%

Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then

Access:

JoHo members

What role do probability distributions play in statistical inference? – Chapter 4

4.1 What are the basic rules of probability?
4.2 What is the difference in probability distributions for discrete and continuous variables?
4.3 How does the normal distribution work exactly?
4.4 What is the difference between sample distributions and sampling distributions?
4.5 How do you create the sampling distribution for a sample mean?
4.6 What is the connection between the population, the sample data and the sampling distribution?

4.1 What are the basic rules of probability?

Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.

A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).

Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.

Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.

4.2 What is the difference in probability distributions for discrete and continuous variables?

A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values,

Access:

JoHo members

How can you make estimates for statistical inference? – Chapter 5

5.1 How do you make point estimates and interval estimates?
5.2 How do you calculate the confidence level for a proportion?
5.3 How do you calculate the confidence level for a mean?
5.4 How do you choose the sample size?
5.5 What do maximum likelihood and bootstrap methods do?

5.1 How do you make point estimates and interval estimates?

Sample data is used for estimating parameters that give information about the population, such as proportions and means. For quantitative variables the population mean is estimated (like how much money on average is spent on medicine in a certain year). For categorical variables the population proportions are estimated for the categories (like how many people do and don't have medical insurance in a certain year).

Two kinds of parameter estimates exist;

A point estimate is a number that is the best prediction.
An interval estimate is an interval surrounding a point estimate, which you think contains the population parameter.

There is a difference between the estimator (the way that estimates are made) and the estimate point (the estimated number itself). For instance, a sample is an estimator for the population parameter and 0,73 is an estimate point of the population proportion that believes in love at first sight.

A good estimator has a sampling distribution that is centered around the parameter and that has a standard error as small as possible.

An estimate isn't biased when the sampling distribution is centered around the parameter. This is especially the case when the sample mean is the population parameter. In that case ӯ (sample mean) equals µ (population mean). ӯ is then regarded a good estimator for µ.

When an estimate is biased, the sample mean doesn't estimate the population mean well. Usually the sample mean is below, because the extremes from a sample can't be more than those of the population, only less. The sample variety is smaller, allowing the sample variety to underestimate the population variety.

An estimator should also have a small standard error. An estimator is called efficient when the standard error is smaller than that of other estimator. Imagine a normal distribution. The standard error of the median is 25% bigger than the standard error of the mean. The sample mean is closer to the population mean than the sample median is. The sample mean is a more efficient estimator then.

A good estimator is unbiased (meaning the sampling distribution is centered around the parameter) and efficient (meaning it has the smallest standard error).

Usually the sample mean serves as an estimator for the population mean, the sample standard deviation

Access:

JoHo members

How do you perform significance tests? – Chapter 6

6.1 What are the five components of a significance test?
6.2 How do you perform a significance test for a mean?
6.3 How do you perform a significance test for a proportion?
6.4 Which errors can be made in significance tests?
6.5 Which limitations do significance tests have?
6.6 How can you calculate the probability of type II error?
6.7 How is the binomial distribution used in significance rests for small samples?

6.1 What are the five components of a significance test?

A hypothesis is a prediction that a parameter within the population has a certain value or falls within a certain interval. A distinction can be made between two kinds of hypotheses. A null hypothesis (H₀) is the assumption that a parameter will assume a certain value. Opposite is the alternative hypothesis (H_a), the assumption that the parameter falls in a range outside of that value. Usually the null hypothesis means no effect. A significance test (also called hypothesis test or test) finds if enough material exists to support the alternative hypothesis. A significance test compares point estimates of parameters with the expected values of the null hypothesis.

Significance tests consist of five parts:

Assumption. Each test makes assumptions about the type of data (quantitative/categorical), the required level of randomization, the population distribution (for instance the normal distribution) and the sample size.
Hypotheses. Each test has a null hypothesis and an alternative hypothesis.
Test statistic. This indicates how far the estimate lies from the parameter value of H₀. Often, this is shown by the number of standard errors between the estimate and the value of H₀.
P-value. This gives the weight of evidence against H₀. The smaller the P-value is, the more evidence that H₀ is incorrect and that H_a is correct.
Conclusion. This is an interpretation of the P-value and a decision on whether H₀ should be accepted or rejected.

6.2 How do you perform a significance test for a mean?

Significance tests for quantitative variables usually research the population mean µ. The five parts of a significance test come to play here.

Assumed is that the data is retrieved from a random sample and it has the normal distribution.

The hypothesis is two-sided, meaning that both a null hypothesis and an alternative hypothesis exist. Usually the null hypothesis is H₀: µ = µ₀, in which µ₀ is the value of the population mean. This hypothesis says that there is no effect (0). The alternative hypothesis then contains all other values and looks

Access:

JoHo members

How do you compare two groups in statistics? - Chapter 7

7.1 What are the basic rules for comparing two groups?
7.2 How do you compare two proportions of categorical data?
7.3 How do you compare two means of quantitative data?
7.4 How do you compare the means of dependent samples?
7.5 Which complex methods can be used for comparing means?
7.6 Which complex methods can be used for comparing proportions?
7.7 Which nonparametric methods exist for comparing groups?

7.1 What are the basic rules for comparing two groups?

In social science often two groups are compared. For quantitative variables means are compared, for categorical variables proportions. When comparing two groups, a binary variable is created: a variable with two categories (also called dichotomous). For instance for sex as a variable the results are men and women. This is an example of bivariate statistics.

Two groups can be dependent or independent. They are dependent when the respondents naturally match with each other. An example is longitudinal research, where the same group is measured in two moments in time. For an independent sample the groups don't match, for instance in cross-sectional research, where people are randomly selected from the population.

Imagine comparing two independent groups: men and women and the time they spend sleeping. Men and women are two different groups, with two population means, two estimates and two standard errors. The standard error indicates how much the mean differs for each sample. Because we want to investigate the difference, also this difference has a standard error. The population difference is estimated by the sample difference. What you want to know, is µ₂ – µ₁, this is estimated by ȳ₂ – ȳ₁. This can be shown in a sampling distribution. The standard error of ȳ₂ – ȳ₁ indicates how much the mean varies between samples. The formula is:

Estimated standard error = $\sqrt{(se_1)^2+(se_2)^2}$

In this case se₁ is the standard error of group 1 (men) and se₂ the standard error of group 2 (women).

Instead of the difference also the ratio can be given. This is especially useful in case of very small proportions.

7.2 How do you compare two proportions of categorical data?

The difference between the proportions of two populations (π₂ – π₁) is estimated by the difference between the sampling proportions. When the samples are very large, the difference is small.

The confidence interval is the point estimate of the difference ± the t-score multiplied with the standard error. The formula for the group difference is:

confidence interval $(\hat{\pi}_2 -\hat{\pi}_1)\pm z(se)$ in which $se = \sqrt{\frac{(\hat{\pi}_1(1-\hat{\pi}_1)}{n_1}+\frac{(\hat{\pi}_2(1-\hat{\pi}_2)}{n_2}}$

When

Access:

JoHo members

How do you analyze the association between categorical variables? – Chapter 8

8.1 How do you create and interpret a contingency table?
8.2 What is a chi-squared test?
8.3 In which way do residuals help to analyze the association between variables?
8.4 How is the association in a contingency table measured?
8.5 How do you measure the association between ordinal variables?

8.1 How do you create and interpret a contingency table?

A contingency table contains the outcomes of all possible combinations of categorical data. A 4x5 contingency table has 4 rows and 5 columns. It often indicates percentages, this is called relative data.

A conditional distribution means that the data is dependent on a certain condition and shown as percentages of a subtotal, like women that have a cold. A marginal distribution contains the separate numbers. A simultaneous distribution shows the percentages with respect to the entire sample.

Two categorical variables are statistically independent when the probability that one occurs is unrelated to the probability that the other occurs. So this is when the probability distribution of one variable is not influenced by the outcome of the other variable. If this does happen, they are statistically dependent.

8.2 What is a chi-squared test?

When two variables are independent, this gives information about variables in the population. Probably the sample will be similarly distributed, but not necessarily. The variability can be high. A significance test tells whether it's plausible that the variables really are independent in the population. The hypotheses for this test are:

H₀: the variables are statistically independent

H_a: the variables are statistically dependent

A cell in a contingency table shows the observed frequency (f_o), the number of times that an observation is made. The expected frequency (f_e) is the number that is expected if the null hypothesis is true, so when the variables are independent. The expected frequency is calculated by adding the total of a row to a total of a column and then dividing this number by the sample size.

A significance test for independence uses a special test statistic. X² says how close the expected frequencies are to the observed frequencies. The test that is performed, is called the chi-squared test (of indepence). The formula for this test is:

$X^2 = \sum \frac{(f_o-f_e)^2}{f_e}$

This method was developed by Karl Pearson. When X² is small, the expected and observed frequencies are close together. The bigger X², the further they are apart. So this test statistic gives information on the level of coincidence.

A binomial distribution shows the probabilities of outcomes of a small sample with categorical discrete variables, like tossing a coin. This is not a distribution of observations or a

Access:

JoHo members

How do linear regression and correlation work? – Chapter 9

9.1 What are linear associations?
9.2 What is the least squares prediction equation?
9.3 What is a linear regression model?
9.4 How does the correlation measure the association of a linear function?
9.5 How do you predict the slope and the correlation?
9.6 What happens when the assumptions of a linear model are violated?

9.1 What are linear associations?

Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.

The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.

The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.

The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.

The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.

When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are independent.

A linear function is an example of a model; a simplified approximation of the association between variables in the population. A model can be good or bad. A regression model usually means a model more complex than a linear function.

9.2 What is the least squares prediction equation?

In regression analysis α and β are regarded as unknown parameters that can be estimated using the available data. Each value of y is a point in a graph and can be written with its coordinates (x, y). A graph is used as a visual check whether it makes sense to make a linear function. If the data is U-shaped, a straight line doesn't make sense.

The variable y is estimated by ŷ. The equation is estimated

Access:

JoHo members

Which types of multivariate relationships exist? – Chapter 10

10.1 How does causality relate to associations?
10.2 How do you control whether other variables influence a causal relationship?
10.3 Which types of multivariate relationships exist?
10.4 What are the consequences of statistical control for inference?

10.1 How does causality relate to associations?

Many scientifical studies research more than two variables, requiring multivariate methods. A lot of research is focussed on the causal relationship between variables, but finding proof of causality is difficult. A relationship that appears causal may be caused by another variable. Statistical control is the method of checking whether an association between variables changes or disappears when the influence of other variables is removed. In a causal relationship, x → y, the explanatory variable x causes the response variable y. This is asymmetrical, because y does not need to cause x.

There are three criteria for a causal relationship:

Association between the variables
Appropriate time order
Elimination of alternative explanations

An association is required for a causal relationship but it doesn't necessitate it. Usually it immediately becomes clear what is a logical time order, such as an explanatory variable preceding a response variable. Apart from x and y, extra variables may provide an alternative explanation. In observational studies it can almost never be proved that a variable causes another variables, this isn't certain. Sometimes there can be outliers or anecdotes that contradict causality, but usually a single anecdote isn't enough proof to contradict causality. It's easier to find causality with randomized experiments than with observational studies. This is because randomization appoints two groups randomly and sets the time frame before starting the experiment.

10.2 How do you control whether other variables influence a causal relationship?

Eliminating alternative explanations is often tricky. A method of testing the influence of other variables is controlling them; eliminating them or keeping them on a constant value. Controlling means taking care that the control variables (the other variables) don't have an influence anymore on the association between x and y. A random experiment in a way also uses control variables; the subjects are selected randomly and the other variables manifest themselves randomly in the subjects.

Statistical control is different from experimental control. In statistical control, subjects with certain characteristics are grouped together. Observational studies in social science often form groups based on socio-economic status, education or income.

The association between two quantitative variables is shown in a scatter plot. Controlling this association for a categorical variable is done by comparing the means.

The association between two categorical variables is shown in a contingency table. Controlling this association

Access:

JoHo members

What is multiple regression? – Chapter 11

11.1 What does a multiple regression model look like?
11.2 How do you interpret the coefficient of determination for multiple regression?
11.3 How do you predict the values of multiple regression coefficients?
11.4 How does a statistical model represent interaction effects?
11.5 How do you compare possible regression models?
11.6 How do you calculate the partial correlation?
11.7 How do you compare the coefficients of variables with different units of measurement by using standardized regression coefficients?

11.1 What does a multiple regression model look like?

A multiple regression model has more than one explanatory variable and sometimes also (a) controle variable(s): E(y) = α + β₁x₁ + β₂x₂. The explanatory variables are numbered: x₁, x₂, etc. When an explanatory variable is added, then the equation is extended with β₂x₂. The parameters are α, β₁ and β₂. The y-axis is vertical, x₁ is horizontal and x₂ is perpendicular to x₁. In this three-dimensional graph the multiple regression equation describes a flat surface, called a plane.

A partial regression equation describes only part of the possible observations, only those with a certain value.

In multiple regression a coefficient indicates the effect of an explanatory variable on a response variable, while controlling for other variables. Bivariate regression completely ignores the other variables, multiple regression only brushes them aside for a bit. This is the basic difference between bivariate and multiple regression. The coefficient (like β₁) of a predictor (like x₁) tells what is the change in the mean of y when the predictor is raised by one point, controlling for the other variables (like x₂). In that case, β₁ is a partial regression coefficient. The parameter α is the mean of y when all explanatory variables are 0.

The multiple regression model has its limitations. An association doesn't automatically mean that there is a causal relationship, there may be other factors. Some researchers are more careful and call statistical control 'adjustment'. The regular multiple regression model assumes that there is no statistical interaction and that the slope β doesn't depend on which combination of explanatory variables is formed.

Multiple regression that exists in the population is estimated by the prediction equation : ŷ = a + b₁ x₁ + b₂ x₂+ … + b _p x _p in which p is the number of explanatory variables.

Just like the bivariate model, the multiple regression model uses residuals to measure prediction errors. For a predicted response ŷ and a measured response y, the residual is the difference between them: y – ŷ. The SSE (Sum of Squared Errors/Residual Sum of Squares) is similar as for bivariate models: SSE = Σ (y – ŷ)², the only difference is the fact that the estimate ŷ is shaped

Access:

JoHo members

What is ANOVA? – Chapter 12

12.1 How do dummy variables replace categories?
12.2 How do you make multiple comparisons of means?
12.3 What is one-way ANOVA?
12.4 What is two-way ANOVA?
12.5 How does ANOVA with repeated measures work?
12.6 How does two-way ANOVA with repeated measures of a factor work?

12.1 How do dummy variables replace categories?

For analyzing categorical variables without assigning a ranking, dummy variables are an option. This means that fake variables are created from observations:

z₁ = 1 and z₂ = 0 : observations of category 1 (men)

z₁ = 0 and z₂ = 1 : observations of category 2 (women)

z₁ = 0 and z₂ = 0 : observations of category 3 (transgender and other identities)

The model is: E(y) = α + β₁z₁ + β₂z₂. The means are deducted from the model: μ₁ = α + β₁ and μ₂ = α + β₂ and μ₃ = α. Three categories only require two dummy variables, because what remains falls in category 3.

A significance test using the F-distribution tests whether the means are the same. The null hypothesis H₀ : μ₁ = μ₂ = μ₃ = 0 is the same as H₀ : β₁ = β₂ = 0. A small F means a big P and much evidence against the null hypothesis.

The F-test is robust against small violations of normality and differences in the standard deviations. However, it can't handle very skewed data. This is why randomization is important.

12.2 How do you make multiple comparisons of means?

A small P doesn't say which means differ or how much. Confidence intervals give more information. For every mean a confidence interval can be constructed, or for the difference between two means. An estimate of the difference in population means is:

$(\bar{y}_i-\bar{y}_j)\pm ts\sqrt{\frac{1}{n_i}+\frac{1}{n_j}}$

The degrees of freedom of the t-score are df = N – g, in which g is the number of categories and N is the combined sample size (n₁ + n₂ + … + n_g). When the confidence interval doesn't contain 0, this is proof of difference between the means.

In case of lots of groups with equal population means, it might happen that a confidence interval finds a difference anyway, due to the increase in errors that comes with the increase in the number of comparisons. Multiple comparison methods check the probability that all intervals of a lot of comparisons contain the real differences. For a 95% confidence interval the probability that one comparison contains an error is 5%, this is the multiple comparison error rate. One such method is the Bonferroni method, which divides the

Access:

JoHo members

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

13.1 What do models with both quantitative and categorical predictors look like?
13.2 Which inferential methods are available for regression with quantitative and categorical predictors?
13.3 In what kind of case studies is multiple regression analysis required?
13.4 How do you use adjusted means?
13.5 What does a linear mixed model look like?

13.1 What do models with both quantitative and categorical predictors look like?

Multiple regression is also feasible for a combination of quantitative and categorical predictors. In a lot of research it makes sense to control for a quantitative variable. A quantitative control variable is called a covariate and it is studied using analysis of covariance (ANCOVA).

A graph helps to research the effect of quantitative predictor x on the response y, while controlling for the categorical predictor z. For two categories, z can be the dummy variable, else more dummy variables are required (like z₁ and z₂). The values of z can be 1 ('agree') or 0 ('don't agree'). If there is no interaction, the lines that fit the data best are parallel and the slopes are the same. It's even possible that the regression lines are exactly the same. But if they aren't parallel, there is interaction.

The predictor can be quantitative and the control variable can be categorial, but this can also be the other way around. Software compares the means. A regression model with three categories is:: E(y) = α + βx + β₁z₁ + β₂z₂, in which β is the effect of x on y for all groups z. For every additional quantitative variable a βx is added. For every additional categorical variable a dummy variable is added (or several, depending on the number of categories). Cross-product terms are added in case of interaction.

13.2 Which inferential methods are available for regression with quantitative and categorical predictors?

The first step to making predictions is testing whether a model needs to include interaction. A F-test compares a model with cross-product terms to a model without. For this the F-test uses the partial sum of squares; the variability in y that is explained by a certain variable when the other aspects are already accounted for. The null hypothesis says that the slopes of the cross-product terms are 0, the alternative hypothesis says that there is interaction. In a graph, interaction looks like this:

Another F-test checks whether a complete or a reduced model is better. To compare a complete model (E(y) = α + βx + β₁z₁ + β₂z₂) with a reduced model (E(y) =

Access:

JoHo members

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

14.1 What strategies are available for selecting a model?
14.2 How can you tell when a statistical model doesn't fit?
14.3 How do you detect multicollinearity and what are its consequences?
14.4 What are the characteristics of generalized linear models?
14.5 What is polynomial regression?
14.6 What do exponential regression and log transforms look like?
14.7 What are robust variance and nonparametric regression?

14.1 What strategies are available for selecting a model?

Three basic rules for selecting variables to add to a model are:

Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variables
Add enough variables for a good predictive power
Keep the model simple

The explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.

Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.

Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted R² is used:

$R_{adj}^2 = \frac{s_y^2-s^2}{s_y^2}$

The adjusted R² decreases when an unnecessary variable is added.

Cross-validation continuously checks whether the predicted values are as close as possible to the observed values. The result is the predicted residual sum of squares (PRESS):

$PRESS = \sum (y_i - \hat{y}_i)^2$

If PRESS decreases, the predictions get better. However, this test assumes a normal distribution. A method that can handle other distributions, is the Akaike information criterion (AIC), which selects the model in which ŷ_i is as close as possible to E(y_i). If AIC decreases, the predictions get better.

14.2 How can you tell when a statistical model doesn't fit?

Inference of parameters in a regression model has the following assumptions:

The model fits the shape of the data
The conditional distribution of y is normal
The standard deviation is constant in the range of values of the explanatory variables (this is called homoscedasticity)

Access:

JoHo members

What is logistic regression? – Chapter 15

15.1 What are the basics of logistic regression?
15.2 What does multiple logistic regression look like?
15.3 How does inference with logistic regression models work?
15.4 How is logistic regression performed for ordinal variables?
15.5 What do logistic models with nominal responses look like?
15.6 How do loglinear models describe the associations between categorical variables?
15.7 How do goodness-of-fit tests work for contingency tables?

15.1 What are the basics of logistic regression?

A logistic regression model is a model with a binary response variable (like 'agree' or 'don't agree'). It's also possible for logistic regression models to have ordinal or nominal response variables. The mean is the proportion of responses that are 1. The linear probability model is P(y=1) = α + βx. This model often is too simple, a more extended version is:

$log \left[\frac{P(y=1))}{1-P(y=1))} \right]=\alpha + \beta x$

The logarithm can be calculated using software. The odds are:: P(y=1)/[1-P(y=1)]. The log of the odds, or logistic transformation (abbreviated as logit) is the logistic regression model: logit[P(y=1)] = α + βx.

To find the outcome for a certain value of a predictor, the following formula is used:

$P(y=1) =\frac{e^{\alpha+\beta{x}}}{1+e^{\alpha+\beta{x}}}$

The e to a certain power is the antilog of that number.

A straight line is drawn next to the curve of a logistic graph to analyze it. β is maximal where P(y=1) = ½. For logistic regression the maximal likelihood method is used instead of the least squares method. The model expressed in odds is:

$\frac{P(y=1)}{1-P(y=1)} = e^{\alpha+\beta{x}}=e^{\alpha}(e^\beta)^x$

The estimate is:

$\frac{\hat{P}(y=1)}{1-\hat{P}(y=1)}$

With this the odds ratio can be calculated.

There are two possibilities to present the data. For ungrouped data a normal contingency table suffices. For grouped data a row contains data for every count in a cel, like just one row with the number of subjects that agreed, followed by the total number of subjects.

An alternative of the logit is the probit. This link assumes a hidden, underlying continuous variable y* that is 1 above a certain value T (threshold) and that is 0 below T. Because y* is hidden, it's called a latent variable. However, it can be used to make a probit model: probit[P(y=1)] = α + βx.

Logistic regression with repeated measures and random effects is analyzed with a linear mixed model: logit[P(y_ij = 1)] = α + βx_ij + s_i.

15.2 What does multiple logistic regression look like?

The multiple logistic regression model is: logit[P(y = 1)] = α + β₁x₁ + … + β_px_p. The further β_i is from 0, the stronger

Access:

JoHo members

Selected contributions for Introduction to Statistics

What are statistical methods? – Chapter 1

1.1 What is statistics and how can you learn it?
1.2 What is the difference between descriptive and inferential statistics?
1.3 What part does software play in statistics?

1.1 What is statistics and how can you learn it?

But statistics aren't only numbers obtained from data. A broader definition of statistics entails all methods to obtain and analyze data.

1.2 What is the difference between descriptive and inferential statistics?

Access:

JoHo members

What are the main measures and graphs of descriptive statistics? - Chapter 3

3.1 Which tables and graphs display data?
3.2 How do you describe the center of data using mean, median and mode?
3.3 How can you measure the variability of data?
3.4 How can you measure quartiles and other positions on a distribution?
3.5 How do you call statistics for multiple variables?
3.6 Which letters are used in formulas to mark the difference between the sample and the population?

3.1 Which tables and graphs display data?

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

Example (relative) frequency distribution:

Gender	Frequence	Proportion	Percentage
Male	150	0.43	43%
Female	200	0.57	57%
Total	350 (=n)	1.00	100%

Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then

Access:

JoHo members

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

People who share their statistical knowledge and skills can contact WorldSupporter Statistics for more exposure to a larger audience. Relevant contributions to specific WorldSupporter Statistics Topics are highlighted per topic so that users who are interested in certain statistical topics can broaden their theoretical perspective and international network.

Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network? Would you like to cooperate with WorldSupporter Statistics? Please send us an e-mail with some basics (Where do you live? What's your (statistical) background? How are you helping others at the moment? And how do you see that in relation to WorldSupporter Statistics?) to info@joho.org - and we will most definitely be in touch.

Startmagazine: Introduction to Statistics

Introduction to Statistics: in short

Statistics comprises the arithmetic procedures to organize, sum up and interpret information. By means of statistics you can note information in a compact manner.
The aim of statistics is twofold: 1) organizing and summing up of information, in order to publish research results and 2) answering research questions, which are formed by

Selected contributions for Data: distributions, connections and gatherings

Which kinds of samples and variables are possible? – Chapter 2

2.1 Which kinds of variables can be measured?
2.2 How does randomization work?
2.3 How do you control variability and bias?
2.4 Which methods can be used for probability sampling?

2.1 Which kinds of variables can be measured?

Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.

The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.

Access:

JoHo members

What are the main measures and graphs of descriptive statistics? - Chapter 3

3.1 Which tables and graphs display data?
3.2 How do you describe the center of data using mean, median and mode?
3.3 How can you measure the variability of data?
3.4 How can you measure quartiles and other positions on a distribution?
3.5 How do you call statistics for multiple variables?
3.6 Which letters are used in formulas to mark the difference between the sample and the population?

3.1 Which tables and graphs display data?

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

Example (relative) frequency distribution:

Gender	Frequence	Proportion	Percentage
Male	150	0.43	43%
Female	200	0.57	57%
Total	350 (=n)	1.00	100%

Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then

Access:

JoHo members

What role do probability distributions play in statistical inference? – Chapter 4

4.1 What are the basic rules of probability?
4.2 What is the difference in probability distributions for discrete and continuous variables?
4.3 How does the normal distribution work exactly?
4.4 What is the difference between sample distributions and sampling distributions?
4.5 How do you create the sampling distribution for a sample mean?
4.6 What is the connection between the population, the sample data and the sampling distribution?

4.1 What are the basic rules of probability?

A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

4.2 What is the difference in probability distributions for discrete and continuous variables?

A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values,

Access:

JoHo members

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

Understanding data: distributions, connections and gatherings

In short: Data

Data is any collection of facts, statistics, or information that can be used for analysis or decision-making. It can be raw or processed, and it can be in the form of numbers, text, images, or sounds.

How to use this summary?

Associate with your Field of Study

Search Summaries or Notes

Start using Summaries

Add a Summary

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Check more of this topic?

Psychologie en gedrag

Check all content related to:

Learn & Study

Universiteit Groningen en studieverenigingen

Psychologie en gedrag

How to use more summaries?

Online access to all summaries, study notes en practice exams
Using and finding summaries, study notes en practice exams on JoHo WorldSupporter
Quicklinks to fields of study (main tags and taxonomy terms)

Online access to all summaries, study notes en practice exams

Check out: Register with JoHo WorldSupporter: starting page (EN)
Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)

Using and finding summaries, study notes en practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

Starting Pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
Use the menu above every page to go to one of the main starting pages
Tags & Taxonomy: gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
Follow authors or (study) organizations: by following individual users, authors and your study organizations you are likely to discover more relevant study materials.
Search tool : 'quick & dirty'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject. The search tool is also available at the bottom of most pages

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Check out: Why and how to add a WorldSupporter contributions
JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form

Quicklinks to fields of study (main tags and taxonomy terms)

Field of study

Check other studie fields?

Main study and working fields

Access level of this page

Public
WorldSupporters only
JoHo members
Private

Statistics

1988

Comments, Compliments & Kudos:

Add new contribution

Promotions

JoHo kan jouw hulp goed gebruiken! Check hier de diverse studentenbanen die aansluiten bij je studie, je competenties verbeteren, je cv versterken en een bijdrage leveren aan een tolerantere wereld