Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.Example (relative) frequency distribution:GenderFrequenceProportionPercentageMale1500.4343%Female2000.5757%Total350 (=n)1.00100%Aside from tables also other visual displays are used...


Access options

      How do you get full online access and services on JoHo WorldSupporter.org?

      1 - Go to www JoHo.org, and join JoHo WorldSupporter by choosing a membership + online access
       
      2 - Return to WorldSupporter.org and create an account with the same email address
       
      3 - State your JoHo WorldSupporter Membership during the creation of your account, and you can start using the services
      • You have online access to all free + all exclusive summaries and study notes on WorldSupporter.org and JoHo.org
      • You can use all services on JoHo WorldSupporter.org (EN/NL)
      • You can make use of the tools for work abroad, long journeys, voluntary work, internships and study abroad on JoHo.org (Dutch service)
      Already an account?
      • If you already have a WorldSupporter account than you can change your account status from 'I am not a JoHo WorldSupporter Member' into 'I am a JoHo WorldSupporter Member with full online access
      • Please note: here too you must have used the same email address.
      Are you having trouble logging in or are you having problems logging in?

      Toegangsopties (NL)

      Hoe krijg je volledige toegang en online services op JoHo WorldSupporter.org?

      1 - Ga naar www JoHo.org, en sluit je aan bij JoHo WorldSupporter door een membership met online toegang te kiezen
      2 - Ga terug naar WorldSupporter.org, en maak een account aan met hetzelfde e-mailadres
      3 - Geef bij het account aanmaken je JoHo WorldSupporter membership aan, en je kunt je services direct gebruiken
      • Je hebt nu online toegang tot alle gratis en alle exclusieve samenvattingen en studiehulp op WorldSupporter.org en JoHo.org
      • Je kunt gebruik maken van alle diensten op JoHo WorldSupporter.org (EN/NL)
      • Op JoHo.org kun je gebruik maken van de tools voor werken in het buitenland, verre reizen, vrijwilligerswerk, stages en studeren in het buitenland
      Heb je al een WorldSupporter account?
      • Wanneer je al eerder een WorldSupporter account hebt aangemaakt dan kan je, nadat je bent aangesloten bij JoHo via je 'membership + online access ook je status op WorldSupporter.org aanpassen
      • Je kunt je status aanpassen van 'I am not a JoHo WorldSupporter Member' naar 'I am a JoHo WorldSupporter Member with 'full online access'.
      • Let op: ook hier moet je dan wel hetzelfde email adres gebruikt hebben
      Kom je er niet helemaal uit of heb je problemen met inloggen?

      Join JoHo WorldSupporter!

      What can you choose from?

      JoHo WorldSupporter membership (= from €5 per calendar year):
      • To support the JoHo WorldSupporter and Smokey projects and to contribute to all activities in the field of international cooperation and talent development
      • To use the basic features of JoHo WorldSupporter.org
      JoHo WorldSupporter membership + online access (= from €10 per calendar year):
      • To support the JoHo WorldSupporter and Smokey projects and to contribute to all activities in the field of international cooperation and talent development
      • To use full services on JoHo WorldSupporter.org (EN/NL)
      • For access to the online book summaries and study notes on JoHo.org and Worldsupporter.org
      • To make use of the tools for work abroad, long journeys, voluntary work, internships and study abroad on JoHo.org (NL service)

      Sluit je aan bij JoHo WorldSupporter!  (NL)

      Waar kan je uit kiezen?

      JoHo membership zonder extra services (donateurschap) = €5 per kalenderjaar
      • Voor steun aan de JoHo WorldSupporter en Smokey projecten en een bijdrage aan alle activiteiten op het gebied van internationale samenwerking en talentontwikkeling
      • Voor gebruik van de basisfuncties van JoHo WorldSupporter.org
      • Voor het gebruik van de kortingen en voordelen bij partners
      • Voor gebruik van de voordelen bij verzekeringen en reisverzekeringen zonder assurantiebelasting
      JoHo membership met extra services (abonnee services):  Online toegang Only= €10 per kalenderjaar
      • Voor volledige online toegang en gebruik van alle online boeksamenvattingen en studietools op WorldSupporter.org en JoHo.org
      • voor online toegang tot de tools en services voor werk in het buitenland, lange reizen, vrijwilligerswerk, stages en studie in het buitenland
      • voor online toegang tot de tools en services voor emigratie of lang verblijf in het buitenland
      • voor online toegang tot de tools en services voor competentieverbetering en kwaliteitenonderzoek
      • Voor extra steun aan JoHo, WorldSupporter en Smokey projecten

      Meld je aan, wordt donateur en maak gebruik van de services

      Check page access:
      JoHo members
      Check more or recent content:

      Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

      What are statistical methods? – Chapter 1

      What are statistical methods? – Chapter 1


      1.1 What is statistics and how can you learn it?

      Statistics is used more and more often to study the behavior of people, not only by the social sciences but also by companies. Everyone can learn how to use statistics, even without much knowledge of mathematics and even with fear of statistics. Most important are logic thinking and perseverance.

      To first step to using statistical methods is collecting data. Data are collected observations of characteristics of interest. For instance the opinion of 1000 people on whether marihuana should be allowed. Data can be obtained through questionnaires, experiments, observations or existing databases.

      But statistics aren't only numbers obtained from data. A broader definition of statistics entails all methods to obtain and analyze data.

      1.2 What is the difference between descriptive and inferential statistics?

      Before being able to analyze data, a design is made on how to obtain the data. Next there are two sorts of statistical analyses; descriptive statistics and inferential statistics. Descriptive statistics summarizes the information obtained from a collection of data, so the data is easier to interpret. Inferential statistics makes predictions with the help of data. Which kind of statistics is used, depends on the goal of the research (summarize or predict).

      To understand the differences better, a number of basic terms are important. The subjects are the entities that are observed in a research study, most often people but sometimes families, schools, cities etc. The population is the whole of subjects that you want to study (for instance foreign students). The sample is a limited number of selected subjects on which you will collect data (for instance 100 foreign students from several universities). The ultimate goal is to learn about the population, but because it's impossible to research the entire population, a sample is made.

      Descriptive statistics can be used both in case data is available for the entire population and only for the sample. Inferential statistics is only applicable to samples, because predictions for a yet unknown future are made. Hence the definition of inferential statistics is making predictions about a population, based on data gathered from a sample.

      The goal of statistics is to learn more about the parameter. The parameter is the numerical summary of the population, or the unknown value that can tell something about the ultimate conditions of the whole. So it's not about the sample but about the population. This is why an important part of

      .....read more
      Access: 
      JoHo members
      Which kinds of samples and variables are possible? – Chapter 2

      Which kinds of samples and variables are possible? – Chapter 2


      2.1 Which kinds of variables can be measured?

      All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.

      The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.

      Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.

      The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.

      The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain order. If a respondent indicates to be neutral, this is happier than considerably unhappy, which in turn is happier than unhappy. Important is that the distances between the values cannot be measured, this is the difference between ordinal and interval.

      Quantitative variables have an interval or ratio scale. Interval means that there are measurable differences between the values. For instance temperate in Celcius. There is an order (30 degrees is more than 20) and the difference is clearly measurable and consistent.

      The difference between interval and ratio is that for an interval scale the value can't be zero, but for a ratio scale it can be. So the ratio scale has numerical values, with a certain order, with measurable differences and with zero as a possible value. Examples are percentage or income.

      Furthermore there are discrete and continuous variables. A variable is discrete when the possible values can only be limited, separate numbers. A variable is continuous when the values can be anything possible. For instance the number of brothers and sisters is discrete, because it's not possible to have 2.43 brother/sister. And for instance

      .....read more
      Access: 
      JoHo members
      What are the main measures and graphs of descriptive statistics? - Chapter 3

      What are the main measures and graphs of descriptive statistics? - Chapter 3


      3.1 Which tables and graphs display data?

      Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

      To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

      Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

      Example (relative) frequency distribution:

      Gender

      Frequence

      Proportion

      Percentage

      Male

      150

      0.43

      43%

      Female

      200

      0.57

      57%

      Total

      350 (=n)

      1.00

      100%

      Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

      A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

      A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

      Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

      A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then

      .....read more
      Access: 
      JoHo members
      What role do probability distributions play in statistical inference? – Chapter 4

      What role do probability distributions play in statistical inference? – Chapter 4


      4.1 What are the basic rules of probability?

      Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.

      A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

      Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).

      Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.

      Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.

      4.2 What is the difference in probability distributions for discrete and continuous variables?

      A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values,

      .....read more
      Access: 
      JoHo members
      How can you make estimates for statistical inference? – Chapter 5

      How can you make estimates for statistical inference? – Chapter 5


      5.1 How do you make point estimates and interval estimates?

      Sample data is used for estimating parameters that give information about the population, such as proportions and means. For quantitative variables the population mean is estimated (like how much money on average is spent on medicine in a certain year). For categorical variables the population proportions are estimated for the categories (like how many people do and don't have medical insurance in a certain year).

      Two kinds of parameter estimates exist;

      • A point estimate is a number that is the best prediction.

      • An interval estimate is an interval surrounding a point estimate, which you think contains the population parameter.

      There is a difference between the estimator (the way that estimates are made) and the estimate point (the estimated number itself). For instance, a sample is an estimator for the population parameter and 0,73 is an estimate point of the population proportion that believes in love at first sight.

      A good estimator has a sampling distribution that is centered around the parameter and that has a standard error as small as possible.

      An estimate isn't biased when the sampling distribution is centered around the parameter. This is especially the case when the sample mean is the population parameter. In that case ӯ (sample mean) equals µ (population mean). ӯ is then regarded a good estimator for µ.

      When an estimate is biased, the sample mean doesn't estimate the population mean well. Usually the sample mean is below, because the extremes from a sample can't be more than those of the population, only less. The sample variety is smaller, allowing the sample variety to underestimate the population variety.

      An estimator should also have a small standard error. An estimator is called efficient when the standard error is smaller than that of other estimator. Imagine a normal distribution. The standard error of the median is 25% bigger than the standard error of the mean. The sample mean is closer to the population mean than the sample median is. The sample mean is a more efficient estimator then.

      A good estimator is unbiased (meaning the sampling distribution is centered around the parameter) and efficient (meaning it has the smallest standard error).

      Usually the sample mean serves as an estimator for the population mean, the sample standard deviation

      .....read more
      Access: 
      JoHo members
      How do you perform significance tests? – Chapter 6

      How do you perform significance tests? – Chapter 6


      6.1 What are the five components of a significance test?

      A hypothesis is a prediction that a parameter within the population has a certain value or falls within a certain interval. A distinction can be made between two kinds of hypotheses. A null hypothesis (H0) is the assumption that a parameter will assume a certain value. Opposite is the alternative hypothesis (Ha), the assumption that the parameter falls in a range outside of that value. Usually the null hypothesis means no effect. A significance test (also called hypothesis test or test) finds if enough material exists to support the alternative hypothesis. A significance test compares point estimates of parameters with the expected values of the null hypothesis.

      Significance tests consist of five parts:

      • Assumption. Each test makes assumptions about the type of data (quantitative/categorical), the required level of randomization, the population distribution (for instance the normal distribution) and the sample size.

      • Hypotheses. Each test has a null hypothesis and an alternative hypothesis.

      • Test statistic. This indicates how far the estimate lies from the parameter value of H0. Often, this is shown by the number of standard errors between the estimate and the value of H0.

      • P-value. This gives the weight of evidence against H0. The smaller the P-value is, the more evidence that H0 is incorrect and that Ha is correct.

      • Conclusion. This is an interpretation of the P-value and a decision on whether H0 should be accepted or rejected.

      6.2 How do you perform a significance test for a mean?

      Significance tests for quantitative variables usually research the population mean µ. The five parts of a significance test come to play here.

      Assumed is that the data is retrieved from a random sample and it has the normal distribution.

      The hypothesis is two-sided, meaning that both a null hypothesis and an alternative hypothesis exist. Usually the null hypothesis is H0: µ = µ0 , in which µ0 is the value of the population mean. This hypothesis says that there is no effect (0). The alternative hypothesis then contains all other values and looks

      .....read more
      Access: 
      JoHo members
      How do you compare two groups in statistics? - Chapter 7

      How do you compare two groups in statistics? - Chapter 7


      7.1 What are the basic rules for comparing two groups?

      In social science often two groups are compared. For quantitative variables means are compared, for categorical variables proportions. When comparing two groups, a binary variable is created: a variable with two categories (also called dichotomous). For instance for sex as a variable the results are men and women. This is an example of bivariate statistics.

      Two groups can be dependent or independent. They are dependent when the respondents naturally match with each other. An example is longitudinal research, where the same group is measured in two moments in time. For an independent sample the groups don't match, for instance in cross-sectional research, where people are randomly selected from the population.

      Imagine comparing two independent groups: men and women and the time they spend sleeping. Men and women are two different groups, with two population means, two estimates and two standard errors. The standard error indicates how much the mean differs for each sample. Because we want to investigate the difference, also this difference has a standard error. The population difference is estimated by the sample difference. What you want to know, is µ₂ – µ₁, this is estimated by ȳ2 – ȳ1. This can be shown in a sampling distribution. The standard error of ȳ2 – ȳ1 indicates how much the mean varies between samples. The formula is:

      Estimated standard error =

      In this case se1 is the standard error of group 1 (men) and se2 the standard error of group 2 (women).

      Instead of the difference also the ratio can be given. This is especially useful in case of very small proportions.

      7.2 How do you compare two proportions of categorical data?

      The difference between the proportions of two populations (π2 – π1) is estimated by the difference between the sampling proportions. When the samples are very large, the difference is small.

      The confidence interval is the point estimate of the difference ± the t-score multiplied with the standard error. The formula for the group difference is:

      confidence interval in which 

      When

      .....read more
      Access: 
      JoHo members
      How do you analyze the association between categorical variables? – Chapter 8

      How do you analyze the association between categorical variables? – Chapter 8


      8.1 How do you create and interpret a contingency table?

      A contingency table contains the outcomes of all possible combinations of categorical data. A 4x5 contingency table has 4 rows and 5 columns. It often indicates percentages, this is called relative data.

      A conditional distribution means that the data is dependent on a certain condition and shown as percentages of a subtotal, like women that have a cold. A marginal distribution contains the separate numbers. A simultaneous distribution shows the percentages with respect to the entire sample.

      Two categorical variables are statistically independent when the probability that one occurs is unrelated to the probability that the other occurs. So this is when the probability distribution of one variable is not influenced by the outcome of the other variable. If this does happen, they are statistically dependent.

      8.2 What is a chi-squared test?

      When two variables are independent, this gives information about variables in the population. Probably the sample will be similarly distributed, but not necessarily. The variability can be high. A significance test tells whether it's plausible that the variables really are independent in the population. The hypotheses for this test are:

      H0: the variables are statistically independent

      Ha: the variables are statistically dependent

      A cell in a contingency table shows the observed frequency (fo), the number of times that an observation is made. The expected frequency (fe) is the number that is expected if the null hypothesis is true, so when the variables are independent. The expected frequency is calculated by adding the total of a row to a total of a column and then dividing this number by the sample size.

      A significance test for independence uses a special test statistic. X2 says how close the expected frequencies are to the observed frequencies. The test that is performed, is called the chi-squared test (of indepence). The formula for this test is:

      This method was developed by Karl Pearson. When X2 is small, the expected and observed frequencies are close together. The bigger X2, the further they are apart. So this test statistic gives information on the level of coincidence.

      A binomial distribution shows the probabilities of outcomes of a small sample with categorical discrete variables, like tossing a coin. This is not a distribution of observations or a

      .....read more
      Access: 
      JoHo members
      How do linear regression and correlation work? – Chapter 9

      How do linear regression and correlation work? – Chapter 9


      9.1 What are linear associations?

      Regression analysis is the process of researching associations between quantitative response variables and explanatory variables. It has three aspects: 1) investigating whether an association exists, 2) determining the strength of the association and 3) making a regression equation to predict the value of the response variable using the explanatory variable.

      The response variable is denoted as y and the explanatory variable as x. A linear function means that there is a straight line throughout the points of data in a graph. A linear function is: y = α + β (x). In this alpha (α) is the y-intercept and beta (β) is the slope.

      The x-axis is the horizontal axis and the y-axis is the vertical axis. The origin is the point where x and y are both 0.

      The y-intercept is the value of y when x = 0. In that case β(x) equals 0, only y = α remains. The y-intercept is the location where the line starts on the y-axis.

      The slope (β) indicates the change in y for a change of 1 in x. So the slope is an indication of how steep the line is. Generally when β increases, the line becomes steeper.

      When β is positive, then y increases when x increases (a positive relationship). When β is negative, then y decreases when x increases (a negative relationship). When β = 0, the value of y is constant and doesn't change when x changes. This results in a horizontal line and means that the variables are independent.

      A linear function is an example of a model; a simplified approximation of the association between variables in the population. A model can be good or bad. A regression model usually means a model more complex than a linear function.

      9.2 What is the least squares prediction equation?

      In regression analysis α and β are regarded as unknown parameters that can be estimated using the available data. Each value of y is a point in a graph and can be written with its coordinates (x, y). A graph is used as a visual check whether it makes sense to make a linear function. If the data is U-shaped, a straight line doesn't make sense.

      The variable y is estimated by ŷ. The equation is estimated

      .....read more
      Access: 
      JoHo members
      Which types of multivariate relationships exist? – Chapter 10

      Which types of multivariate relationships exist? – Chapter 10


      10.1 How does causality relate to associations?

      Many scientifical studies research more than two variables, requiring multivariate methods. A lot of research is focussed on the causal relationship between variables, but finding proof of causality is difficult. A relationship that appears causal may be caused by another variable. Statistical control is the method of checking whether an association between variables changes or disappears when the influence of other variables is removed. In a causal relationship, x → y, the explanatory variable x causes the response variable y. This is asymmetrical, because y does not need to cause x.

      There are three criteria for a causal relationship:

      1. Association between the variables

      2. Appropriate time order

      3. Elimination of alternative explanations

      An association is required for a causal relationship but it doesn't necessitate it. Usually it immediately becomes clear what is a logical time order, such as an explanatory variable preceding a response variable. Apart from x and y, extra variables may provide an alternative explanation. In observational studies it can almost never be proved that a variable causes another variables, this isn't certain. Sometimes there can be outliers or anecdotes that contradict causality, but usually a single anecdote isn't enough proof to contradict causality. It's easier to find causality with randomized experiments than with observational studies. This is because randomization appoints two groups randomly and sets the time frame before starting the experiment.

      10.2 How do you control whether other variables influence a causal relationship?

      Eliminating alternative explanations is often tricky. A method of testing the influence of other variables is controlling them; eliminating them or keeping them on a constant value. Controlling means taking care that the control variables (the other variables) don't have an influence anymore on the association between x and y. A random experiment in a way also uses control variables; the subjects are selected randomly and the other variables manifest themselves randomly in the subjects.

      Statistical control is different from experimental control. In statistical control, subjects with certain characteristics are grouped together. Observational studies in social science often form groups based on socio-economic status, education or income.

      The association between two quantitative variables is shown in a scatter plot. Controlling this association for a categorical variable is done by comparing the means.

      The association between two categorical variables is shown in a contingency table. Controlling this association

      .....read more
      Access: 
      JoHo members
      What is multiple regression? – Chapter 11

      What is multiple regression? – Chapter 11


      11.1 What does a multiple regression model look like?

      A multiple regression model has more than one explanatory variable and sometimes also (a) controle variable(s): E(y) = α + β1x1 + β2x2. The explanatory variables are numbered: x1, x2, etc. When an explanatory variable is added, then the equation is extended with β2x2. The parameters are α, β1 and β2. The y-axis is vertical, x1 is horizontal and x2 is perpendicular to x1. In this three-dimensional graph the multiple regression equation describes a flat surface, called a plane.

      A partial regression equation describes only part of the possible observations, only those with a certain value.

      In multiple regression a coefficient indicates the effect of an explanatory variable on a response variable, while controlling for other variables. Bivariate regression completely ignores the other variables, multiple regression only brushes them aside for a bit. This is the basic difference between bivariate and multiple regression. The coefficient (like β1) of a predictor (like x1) tells what is the change in the mean of y when the predictor is raised by one point, controlling for the other variables (like x2). In that case, β1 is a partial regression coefficient. The parameter α is the mean of y when all explanatory variables are 0.

      The multiple regression model has its limitations. An association doesn't automatically mean that there is a causal relationship, there may be other factors. Some researchers are more careful and call statistical control 'adjustment'. The regular multiple regression model assumes that there is no statistical interaction and that the slope β doesn't depend on which combination of explanatory variables is formed.

      Multiple regression that exists in the population is estimated by the prediction equation : ŷ = a + b1 x1 + b2 x2 + … + b p x p in which p is the number of explanatory variables.

      Just like the bivariate model, the multiple regression model uses residuals to measure prediction errors. For a predicted response ŷ and a measured response y, the residual is the difference between them: y – ŷ. The SSE (Sum of Squared Errors/Residual Sum of Squares) is similar as for bivariate models: SSE = Σ (y – ŷ)2, the only difference is the fact that the estimate ŷ is shaped

      .....read more
      Access: 
      JoHo members
      What is ANOVA? – Chapter 12

      What is ANOVA? – Chapter 12


      12.1 How do dummy variables replace categories?

      For analyzing categorical variables without assigning a ranking, dummy variables are an option. This means that fake variables are created from observations:

      z1 = 1 and z2 = 0 : observations of category 1 (men)

      z1 = 0 and z2 = 1 : observations of category 2 (women)

      z1 = 0 and z2 = 0 : observations of category 3 (transgender and other identities)

      The model is: E(y) = α + β1z1 + β2z2. The means are deducted from the model: μ1 = α + β1 and μ2 = α + β2 and μ3 = α. Three categories only require two dummy variables, because what remains falls in category 3.

      A significance test using the F-distribution tests whether the means are the same. The null hypothesis H0 : μ1 = μ2 = μ3 = 0 is the same as H0 : β1 = β2 = 0. A small F means a big P and much evidence against the null hypothesis.

      The F-test is robust against small violations of normality and differences in the standard deviations. However, it can't handle very skewed data. This is why randomization is important.

      12.2 How do you make multiple comparisons of means?

      A small P doesn't say which means differ or how much. Confidence intervals give more information. For every mean a confidence interval can be constructed, or for the difference between two means. An estimate of the difference in population means is:

      The degrees of freedom of the t-score are df = N – g, in which g is the number of categories and N is the combined sample size (n1 + n2 + … + ng). When the confidence interval doesn't contain 0, this is proof of difference between the means.

      In case of lots of groups with equal population means, it might happen that a confidence interval finds a difference anyway, due to the increase in errors that comes with the increase in the number of comparisons. Multiple comparison methods check the probability that all intervals of a lot of comparisons contain the real differences. For a 95% confidence interval the probability that one comparison contains an error is 5%, this is the multiple comparison error rate. One such method is the Bonferroni method, which divides the

      .....read more
      Access: 
      JoHo members
      How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

      How does multiple regression with both quantitative and categorical predictors work? – Chapter 13


      13.1 What do models with both quantitative and categorical predictors look like?

      Multiple regression is also feasible for a combination of quantitative and categorical predictors. In a lot of research it makes sense to control for a quantitative variable. A quantitative control variable is called a covariate and it is studied using analysis of covariance (ANCOVA).

      A graph helps to research the effect of quantitative predictor x on the response y, while controlling for the categorical predictor z. For two categories, z can be the dummy variable, else more dummy variables are required (like z1 and z2). The values of z can be 1 ('agree') or 0 ('don't agree'). If there is no interaction, the lines that fit the data best are parallel and the slopes are the same. It's even possible that the regression lines are exactly the same. But if they aren't parallel, there is interaction.

      The predictor can be quantitative and the control variable can be categorial, but this can also be the other way around. Software compares the means. A regression model with three categories is:: E(y) = α + βx + β1z1 + β2z2, in which β is the effect of x on y for all groups z. For every additional quantitative variable a βx is added. For every additional categorical variable a dummy variable is added (or several, depending on the number of categories). Cross-product terms are added in case of interaction.

      13.2 Which inferential methods are available for regression with quantitative and categorical predictors?

      The first step to making predictions is testing whether a model needs to include interaction. A F-test compares a model with cross-product terms to a model without. For this the F-test uses the partial sum of squares; the variability in y that is explained by a certain variable when the other aspects are already accounted for. The null hypothesis says that the slopes of the cross-product terms are 0, the alternative hypothesis says that there is interaction. In a graph, interaction looks like this:

      Another F-test checks whether a complete or a reduced model is better. To compare a complete model (E(y) = α + βx + β1z1 + β2z2) with a reduced model (E(y) =

      .....read more
      Access: 
      JoHo members
      How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

      How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14


      14.1 What strategies are available for selecting a model?

      Three basic rules for selecting variables to add to a model are:

      1. Select variables that can answer the theoretical purpose (accepting/rejecting the null hypothesis), with sensible control variables and mediating variables

      2. Add enough variables for a good predictive power

      3. Keep the model simple

      The explanatory variables should be highly correlated to the response variable but not to each other. Software can test and select explanatory variables. Possible strategies are backward elimination, forward selection and stepwise regression. In backward elimination all possible variables are added, tested for their P-value and then only the significant variables are selected. Forward selection starts from scratch, adding variables with the lowest P-value. Another version of this is stepwise regression, this method removes redundant variables when new variables are added.

      Software helps but it's up to the researcher to think and make choices. It also matters whether research is explanatory, starting with a theoretical model with known variables, or whether research is exploratory, openly looking for explanations of a phenomenon.

      Several criteria are indications of a good model. To find a model with big power but without an overabundance of variables, the adjusted R2 is used:

      The adjusted R2 decreases when an unnecessary variable is added.

      Cross-validation continuously checks whether the predicted values are as close as possible to the observed values. The result is the predicted residual sum of squares (PRESS):

      If PRESS decreases, the predictions get better. However, this test assumes a normal distribution. A method that can handle other distributions, is the Akaike information criterion (AIC), which selects the model in which ŷi is as close as possible to E(yi). If AIC decreases, the predictions get better.

      14.2 How can you tell when a statistical model doesn't fit?

      Inference of parameters in a regression model has the following assumptions:

      • The model fits the shape of the data

      • The conditional distribution of y is normal

      • The standard deviation is constant in the range of values of the explanatory variables (this is called homoscedasticity)

      .....read more
      Access: 
      JoHo members
      What is logistic regression? – Chapter 15

      What is logistic regression? – Chapter 15


      15.1 What are the basics of logistic regression?

      A logistic regression model is a model with a binary response variable (like 'agree' or 'don't agree'). It's also possible for logistic regression models to have ordinal or nominal response variables. The mean is the proportion of responses that are 1. The linear probability model is P(y=1) = α + βx. This model often is too simple, a more extended version is:

      The logarithm can be calculated using software. The odds are:: P(y=1)/[1-P(y=1)]. The log of the odds, or logistic transformation (abbreviated as logit) is the logistic regression model: logit[P(y=1)] = α + βx.

      To find the outcome for a certain value of a predictor, the following formula is used:

      The e to a certain power is the antilog of that number.

      A straight line is drawn next to the curve of a logistic graph to analyze it. β is maximal where P(y=1) = ½. For logistic regression the maximal likelihood method is used instead of the least squares method. The model expressed in odds is:

      The estimate is:

      With this the odds ratio can be calculated.

      There are two possibilities to present the data. For ungrouped data a normal contingency table suffices. For grouped data a row contains data for every count in a cel, like just one row with the number of subjects that agreed, followed by the total number of subjects.

      An alternative of the logit is the probit. This link assumes a hidden, underlying continuous variable y* that is 1 above a certain value T (threshold) and that is 0 below T. Because y* is hidden, it's called a latent variable. However, it can be used to make a probit model: probit[P(y=1)] = α + βx.

      Logistic regression with repeated measures and random effects is analyzed with a linear mixed model: logit[P(yij = 1)] = α + βxij + si.

      15.2 What does multiple logistic regression look like?

      The multiple logistic regression model is: logit[P(y = 1)] = α + β1x1 + … + βpxp. The further βi is from 0, the stronger

      .....read more
      Access: 
      JoHo members

      Selected contributions for Introduction to Statistics

      What are statistical methods? – Chapter 1

      What are statistical methods? – Chapter 1


      1.1 What is statistics and how can you learn it?

      Statistics is used more and more often to study the behavior of people, not only by the social sciences but also by companies. Everyone can learn how to use statistics, even without much knowledge of mathematics and even with fear of statistics. Most important are logic thinking and perseverance.

      To first step to using statistical methods is collecting data. Data are collected observations of characteristics of interest. For instance the opinion of 1000 people on whether marihuana should be allowed. Data can be obtained through questionnaires, experiments, observations or existing databases.

      But statistics aren't only numbers obtained from data. A broader definition of statistics entails all methods to obtain and analyze data.

      1.2 What is the difference between descriptive and inferential statistics?

      Before being able to analyze data, a design is made on how to obtain the data. Next there are two sorts of statistical analyses; descriptive statistics and inferential statistics. Descriptive statistics summarizes the information obtained from a collection of data, so the data is easier to interpret. Inferential statistics makes predictions with the help of data. Which kind of statistics is used, depends on the goal of the research (summarize or predict).

      To understand the differences better, a number of basic terms are important. The subjects are the entities that are observed in a research study, most often people but sometimes families, schools, cities etc. The population is the whole of subjects that you want to study (for instance foreign students). The sample is a limited number of selected subjects on which you will collect data (for instance 100 foreign students from several universities). The ultimate goal is to learn about the population, but because it's impossible to research the entire population, a sample is made.

      Descriptive statistics can be used both in case data is available for the entire population and only for the sample. Inferential statistics is only applicable to samples, because predictions for a yet unknown future are made. Hence the definition of inferential statistics is making predictions about a population, based on data gathered from a sample.

      The goal of statistics is to learn more about the parameter. The parameter is the numerical summary of the population, or the unknown value that can tell something about the ultimate conditions of the whole. So it's not about the sample but about the population. This is why an important part of

      .....read more
      Access: 
      JoHo members
      What are the main measures and graphs of descriptive statistics? - Chapter 3

      What are the main measures and graphs of descriptive statistics? - Chapter 3


      3.1 Which tables and graphs display data?

      Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

      To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

      Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

      Example (relative) frequency distribution:

      Gender

      Frequence

      Proportion

      Percentage

      Male

      150

      0.43

      43%

      Female

      200

      0.57

      57%

      Total

      350 (=n)

      1.00

      100%

      Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

      A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

      A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

      Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

      A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then

      .....read more
      Access: 
      JoHo members
      Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

      Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

      People who share their statistical knowledge and skills can contact WorldSupporter Statistics for more exposure to a larger audience. Relevant contributions to specific WorldSupporter Statistics Topics are highlighted per topic so that users who are interested in certain statistical topics can broaden their theoretical perspective and international network.

      Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network? Would you like to cooperate with WorldSupporter Statistics? Please send us an e-mail with some basics (Where do you live? What's your (statistical) background? How are you helping others at the moment? And how do you see that in relation to WorldSupporter Statistics?) to info@joho.org - and we will most definitely be in touch.

      Startmagazine: Introduction to Statistics

      Selected contributions for Data: distributions, connections and gatherings

      Which kinds of samples and variables are possible? – Chapter 2

      Which kinds of samples and variables are possible? – Chapter 2


      2.1 Which kinds of variables can be measured?

      All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.

      The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.

      Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.

      The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.

      The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain order. If a respondent indicates to be neutral, this is happier than considerably unhappy, which in turn is happier than unhappy. Important is that the distances between the values cannot be measured, this is the difference between ordinal and interval.

      Quantitative variables have an interval or ratio scale. Interval means that there are measurable differences between the values. For instance temperate in Celcius. There is an order (30 degrees is more than 20) and the difference is clearly measurable and consistent.

      The difference between interval and ratio is that for an interval scale the value can't be zero, but for a ratio scale it can be. So the ratio scale has numerical values, with a certain order, with measurable differences and with zero as a possible value. Examples are percentage or income.

      Furthermore there are discrete and continuous variables. A variable is discrete when the possible values can only be limited, separate numbers. A variable is continuous when the values can be anything possible. For instance the number of brothers and sisters is discrete, because it's not possible to have 2.43 brother/sister. And for instance

      .....read more
      Access: 
      JoHo members
      What are the main measures and graphs of descriptive statistics? - Chapter 3

      What are the main measures and graphs of descriptive statistics? - Chapter 3


      3.1 Which tables and graphs display data?

      Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

      To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

      Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

      Example (relative) frequency distribution:

      Gender

      Frequence

      Proportion

      Percentage

      Male

      150

      0.43

      43%

      Female

      200

      0.57

      57%

      Total

      350 (=n)

      1.00

      100%

      Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

      A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

      A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

      Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

      A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then

      .....read more
      Access: 
      JoHo members
      What role do probability distributions play in statistical inference? – Chapter 4

      What role do probability distributions play in statistical inference? – Chapter 4


      4.1 What are the basic rules of probability?

      Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.

      A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

      Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).

      Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.

      Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.

      4.2 What is the difference in probability distributions for discrete and continuous variables?

      A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values,

      .....read more
      Access: 
      JoHo members
      Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

      Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

      People who share their statistical knowledge and skills can contact WorldSupporter Statistics for more exposure to a larger audience. Relevant contributions to specific WorldSupporter Statistics Topics are highlighted per topic so that users who are interested in certain statistical topics can broaden their theoretical perspective and international network.

      Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network? Would you like to cooperate with WorldSupporter Statistics? Please send us an e-mail with some basics (Where do you live? What's your (statistical) background? How are you helping others at the moment? And how do you see that in relation to WorldSupporter Statistics?) to info@joho.org - and we will most definitely be in touch.

      Understanding data: distributions, connections and gatherings
      Work for WorldSupporter

      Image

      JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

      Working for JoHo as a student in Leyden

      Parttime werken voor JoHo

      Check more of this topic?
      How to use more summaries?


      Online access to all summaries, study notes en practice exams

      Using and finding summaries, study notes en practice exams on JoHo WorldSupporter

      There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

      1. Starting Pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
      2. Use the menu above every page to go to one of the main starting pages
      3. Tags & Taxonomy: gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
      4. Follow authors or (study) organizations: by following individual users, authors and your study organizations you are likely to discover more relevant study materials.
      5. Search tool : 'quick & dirty'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject. The search tool is also available at the bottom of most pages

      Do you want to share your summaries with JoHo WorldSupporter and its visitors?

      Quicklinks to fields of study (main tags and taxonomy terms)

      Field of study

      Access level of this page
      • Public
      • WorldSupporters only
      • JoHo members
      • Private
      Statistics
      1988 2
      Comments, Compliments & Kudos:

      Add new contribution

      CAPTCHA
      This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
      Image CAPTCHA
      Enter the characters shown in the image.
      Promotions
      vacatures

      JoHo kan jouw hulp goed gebruiken! Check hier de diverse studentenbanen die aansluiten bij je studie, je competenties verbeteren, je cv versterken en een bijdrage leveren aan een tolerantere wereld

      Follow the author: Annemarie JoHo