Analysis of Variance (15)

 

15. Analysis of Variance

There are situations and experiments that require processes to be compared at more than two levels. Data from such experiments can be analysed using analysis of variance or ANOVA.

15.1. Comparing Population Means

There are other ways to compare population means than ANOVA, but these are based on the assumption of either paired observations or independent random samples, and can only be used to compare two population means. ANOVA can be used to compare more than two populations, and also uses assessments of variation, which forms a large problem in other methods.

15.2. One-Way ANOVA

The procedure for testing the equality of population means is called a one-way ANOVA. This procedure is based on the assumption that all included populations have a common variance.

The total sum of squares (SST) in this procedure is made up of a within-group sum of squares (SSW) and a between groups sum of squares (SSG): SST = SSW + SSG

This division of the SST forms the basis of the one-way ANOVA, as it expresses the total variability around the mean for the sample observations.

If the null hypothesis is true (all population means are the same) then both SSW and SSG can be used to estimate the common population variance. This is done by dividing by the appropriate number of degrees of freedom.

Because SSW and SSG both provide an unbiased estimate of the common population variance if the null hypothesis is true, a difference between the two values indicates that the null hypothesis is false. The test of the null hypothesis is thus based on the ratio of mean squares:

Where  and . With the assumptions that the population variances are equal and the population distributions are normal.

The closer the ratio is to 1, the less indication there is that the null hypothesis is false.

 

These results are also summarized in a one-way ANOVA table, which has the following format:

Source of Variation

Sum of Squares

Degrees of Freedom

Mean Squares

F-ratio

Between groups

SSG

K – 1

MSG

MSG/MSW

Within groups

SSW

n – K

MSW

Total

SST

n – 1

 

 

It is also possible to calculate a minimum significant difference (MSD) between two sample means, as evidence to conclude whether the population means are different. This is done:

With sp being the estimate of variance (), n the number of observations, K the number of populations, and Q  being a factor from Table 13 from the Appendix.

15.3. Kruskal-Wallis test

The Kruskal-Wallis test is a nonparametric alternative to the one-way ANOVA and is used when there is a strong indication that the parent population distributions are markedly different from the normal. Like the majority of nonparametric tests this test is based on the ranks of the sample observations. In this test the null hypothesis is based on the calculation:

Where R are the ranks for the sample observations. The hypothesis is rejected if W is larger than χ2k-1,α (a number with probability α, by a random χ2 variable, with (K-1) degrees of freedom.

15.4. Two-way ANOVA

If there is a situation where a second factor also influences the outcome, it is best to design the experiment in such a manner that the influence of this factor can also be taken into account. This additional variable is then called a blocking variable and this design is called a randomized block design, the outcomes of which can be analysed using a two-way ANOVA.

In a randomized block design because the several categories from the two independent variables are randomly combined.

Using the observation for the ith group and the jth block, the population model can be portrayed as following: Xij = μ + Gi + Bj + εij.

Here Xij is the random variable, μ is the overall mean, the parameter Gi measures the discrepancy between the mean of group i and μ, the parameter Bj measures the discrepancy between the mean of block j and μ, and εij represents the experimental error.

In a two-way ANOVA the SST is split up in the between-blocks sum of squares (SSB) and the between-groups sum of squares (SSG), and also contains the error sum of squares (SSE). It is thus split up as: SST = SSB + SSG + SSE.

The null hypothesis of the population group means being equal is then tested through the ratio of the mean square for groups to the mean square error: .

The results of a two-way ANOVA are also best summarized in a two-way ANOVA table. This has the same set-up as a one-way ANOVA table, except for the sources of variation (between groups, between blocks, error, and total).

15.5. Two-Way ANOVA with multiple observations per cell

It is also possible to have more than one observation per cell. This has two advantages:

  1. More sample data leads to more precise estimates meaning that the differences among the population means can be distinguished better.
  2. The interaction between groups and blocks, as a source of variability, can be isolated.

This model thus has three null hypothesis: no difference between group means, no difference between block means, and no group-block interaction.

In this model the SST consists of one more factor: the interaction sum of squares (SSI), corresponding with the extra source of variation: Interaction.

Access: 
Public

Image

This content is also used in .....

Samenvatting Statistics for Business and Economics

Multiple Regression (12)

Multiple Regression (12)

12. Multiple Regression

Simple regression (see chapter 11) can predict a dependent variable as a function of a single independent variable. But often there are multiple variables at play. In order to determine the simultaneous effect of multiple independent variables on a dependent variable, multiple regression is used. The least squares principle fit the model.

12.1. The model

As with simple regression, the first step in the model development is model specification, the selection of the model variables and functional form of the model. This is influenced by the model objectives, namely: (1) predicting the dependent variable, and/or (2) estimating the marginal effect of each independent variable. The second objective is hard to achieve, however, in a model with multiple independent variables, because these variables are not only related to the dependent variable but also to each other. This leaves a web of effects that is not easily untangled.

To make multiple regression models more accurate an error termε” is added, as a way to recognize that none of the described relationships in the model will hold exactly and there are likely to be variables that affect the dependent variable, but are not included in the model.

12.2. Estimating Coefficients

Multiple regression coefficients are calculated with the least squares procedure. However, again this is more complicated than with simple regression, as the independent variables not only affect the dependent variable but also each other. It is not possible to identify the unique effect of each independent variable on the dependent variable. This means that the higher the correlations between two or more of the independent variables in a model are, the less reliable the estimated regression coefficients are.

There are 5 assumptions to standard multiple regression. The first 4 are the same as are made for simple regression (see chapter 11). The 5th states that it is not possible to find a set of nonzero numbers such that the sum of the coefficients equals 0. This assumption excludes the cases in which there is a linear relationship between a pair of independent variables. In most cases this assumption will not be violated if the model is properly specified.

Whereas in simple regression the least squares procedure finds a line that best represents the set of points in space, multiple regression finds a plane that best represents these points (as each variable is represented with its own dimension).

It is important to be aware of the fact that in a multiple regression it is not possible to know which independent variable predicts which change in the dependent variable. After all, the slope coefficient estimated is affected by the correlations between all independent and dependent variables. This also means that any multiple regression coefficient is dependent on all independent variables in the model. These coefficients are thus referred to as conditional coefficients. This is the case is all multiple regression models unless there are two independent variables with a sample correlation of zero (but this is very unlikely). Because of this.....read more

Access: 
Public
Several Aspects of Regression Analysis (13)

Several Aspects of Regression Analysis (13)

13. Several Aspects of Regression Analysis

This chapter focusses on topics that add to the understanding of regression analysis. This includes alternative specifications for these models, and what happens in the situations where basic regression assumptions are violated.

13.1. Developing models

The goal when developing a model is to approximate the complex reality as close as possible with a relatively simple model, which can then be used to provide insight into reality. It is impossible to represent all of the influences in the real situation in a model, instead only the most influential variables are selected.

Building a statistical model has 4 stages:

  1. Model Specification: This step involves the selection of the variables (dependent and independent), the algebraic form of the model, and the required data. In order to this correctly it is important to understand the underlying theory and context for the model. This stage may require serious study and analysis. This step is crucial to the integrity of the model.
  2. Coefficient Estimation: This step involves using the available data to estimate the coefficients and/or parameters in the model. The desired values are dependent on the objective of the model. Roughly there are two goals:
    1. Predicting the mean of the dependent variable: In this case it is desirable to have a small standard error of the estimate, se. The correlations between independent variables need to be steady, and there needs to be a wide spread for these independent variables (as this means that the prediction variance is small).
    2. Estimating one or more coefficients: In this case a number of problems arise, as there is always a trade-off between estimator bias and variance, within which a proper balance must be found. Including an independent variable that is highly correlated with other independent variables decreases bias but increases variance. Excluding the variable decreases variance, but increases bias. This is the case because both these correlations and the spread of the independent variables influence the standard deviation of the slope coefficients, sb.
  3. Model Verification: This step involves checking whether the model is still accurate in its portrayal of reality. This is important because simplifications and assumptions are often made while constructing the model, this can lead to the model becoming (too) inaccurate). It is important to examine the regression assumptions, the model specification, and the selected data. If something is wrong here, we return to step 1.
  4. Interpretation and Inference: This step involves drawing conclusions from the outcomes of the model. Here it is important to remain critical. Inferences drawn from these outcomes can only be accurate if the previous 3 steps have been completed properly. If these outcomes differ from expectations or previous findings you must be critical about whether this is due to the model or whether you really have found something new.

13.2. Further Application of Dummy Variables

Dummy variables were introduced in chapter 12 as a way to include categorical variables in regression analysis. Further uses for these variables will be.....read more

Access: 
Public
Analysis of Variance (15)

Analysis of Variance (15)

 

15. Analysis of Variance

There are situations and experiments that require processes to be compared at more than two levels. Data from such experiments can be analysed using analysis of variance or ANOVA.

15.1. Comparing Population Means

There are other ways to compare population means than ANOVA, but these are based on the assumption of either paired observations or independent random samples, and can only be used to compare two population means. ANOVA can be used to compare more than two populations, and also uses assessments of variation, which forms a large problem in other methods.

15.2. One-Way ANOVA

The procedure for testing the equality of population means is called a one-way ANOVA. This procedure is based on the assumption that all included populations have a common variance.

The total sum of squares (SST) in this procedure is made up of a within-group sum of squares (SSW) and a between groups sum of squares (SSG): SST = SSW + SSG

This division of the SST forms the basis of the one-way ANOVA, as it expresses the total variability around the mean for the sample observations.

If the null hypothesis is true (all population means are the same) then both SSW and SSG can be used to estimate the common population variance. This is done by dividing by the appropriate number of degrees of freedom.

Because SSW and SSG both provide an unbiased estimate of the common population variance if the null hypothesis is true, a difference between the two values indicates that the null hypothesis is false. The test of the null hypothesis is thus based on the ratio of mean squares:

Where  and . With the assumptions that the population variances are equal and the population distributions are normal.

The closer the ratio is to 1, the less indication there is that the null hypothesis is false.

 

These results are also summarized in a one-way ANOVA table, which has the following format:

Source of Variation

Sum of Squares

Degrees of Freedom

Mean Squares

F-ratio

Between groups

SSG

K – 1

MSG

MSG/MSW

Within groups

SSW

n – K

MSW

Total

SST

n – 1

 

 

It is also possible to calculate a minimum significant difference (MSD) between two sample means, as evidence to conclude whether the population means are different. This is done:

With sp being the estimate of variance (),.....read more

Access: 
Public
Predictions with Time-Series Data (16)

Predictions with Time-Series Data (16)

16. Predictions with Time-Series Data

Time series data involves measurements that are ordered over time, in which the sequence of observations is important. Most procedures for data analysis cannot be used for this data, as these procedures are based on the assumption that the errors are independent. Thus, different forms of analysis are needed.

The main goal of analysing time-series data is to make predictions. An important assumption here is that the relations between variables remain constant.

16.1. Time-Series Components

Most time-series have the following four components:

  1. Trend component: Values grow or decrease steadily over long periods of time.
  2. Seasonality component: An oscillatory patterns that is specific per season (quarter year) repeats itself.
  3. Cyclical component: And oscillatory or cyclical pattern that is not related to seasonal behaviour.
  4. Irregular component: No pattern is regular enough to only exist through these predictable trends; each series of data will also have irregular components (similar to the random error term).

Analysis of time-series data involves constructing a formal model in which most of these components are explicitly or implicitly present, in order to describe the behaviour of the data series. In building this model the series components can either be regarded as being fixed over time, or as steadily evolving over time.

16.2. Moving Averages

Moving averages are the basis for many practical adjustment procedures. It can be used to remove the irregular component or smooth seasonal component:

  • Removing the irregular component: This is done by replacing each observation with the average of itself and its neighbours. The theory is that this will decrease the effect of the irregular component on each data point.
  • Smoothing the seasonal component: This is done by producing four-period moving averages in such a manner that the seasonal values become one single seasonal moving average. This does mean that the values have shifted in time (in comparison to the original series), but this can be corrected by centring the averages. The specific procedure always depends on the amount of stability the pattern is assumed to have, and whether seasonality is thought to be additive or multiplicative (in the latter case: use logarithms).
    If there is an assumption of a stable seasonal pattern a further seasonal-adjustment approach can be used: the seasonal index method. Here the original series is expressed as a percentage of the centred 4-point moving average series.

Additionally moving averages are very suitable for detecting cyclical components and/or trends.

16.3. Predictions using smoothing

There are a various prediction methods, and the choice you make should always depend on the resources, the objectives, and the available data.

Simple exponential smoothing is a more basic prediction method that is appropriate when the series is non-seasonal and has no consistent trends. It predicts future values on the basis of an estimate of the current level of the time series. This estimate is comprised of a weighted average of current and past values, where most weight is given to the most recent observations (with decreasing weight.....read more

Access: 
Public
Sampling (17)

Sampling (17)

17. Sampling

There are various ways of sampling a population, according to research and analysis goals.

17.1. Stratified Sampling

Stratified sampling involves breaking the population into strata (a.k.a. subgroups) according to a specific identifiable characteristic in such a way that each member of the population belongs to only one strata. Stratified random sampling is the process of selecting independent simple random samples from each strata. A question that arises here Is how to allocate the sampling effort among the strata. There are various possibilities:

  • Proportional allocation: The proportion of the sample from a stratum is the same as the proportion of that stratum to the population. This is used if there is little to nothing known about the population and there are no strong requirements for the production of information.
  • Optimal allocation: More sample effort is allocated to strata with a higher population variance. This is used if the objective is to estimate an overall population parameter (such as mean, total, or proportion) as precisely as possible. This method is only optimal with this goal in mind.

Analysing the results of stratified random samples is relatively straightforward, and any stratum sample mean (mj) can be used as an unbiased estimator of the population mean (μj). It can also be sued to estimate the population total, as this is the product of the population mean and the number of population members.

17.2. Other Ways to Sample

Various other sampling methods are:

  • Cluster Sampling: This method can be used when a population can be subdivided into small geographical units, or clusters. A simple random sample of clusters is then selected, and each member of these clusters is contacted for data. Using this method very little prior information of the population is needed.
  • Two-Phase Sampling: In this method the regular data-collection is preceded by a smaller pilot study, in which a smaller sample is used. This cost more time but allows for methods and procedures to be improved, and can provide some estimations for the true study.
  • Non-random sampling: There are two main methods:
    • Non-probabilistic sampling: Sample members are selected by convenience. This often means that the sample is not representative of the population and lacks proper statistical validity.
    • Quota sampling: There are specified numbers of people of certain characteristics (race, age, gender etc.) that are contacted. This usually produces quite accurate estimates of population parameters, but it is not possible to determine the reliability of these estimates, because the sample was not randomly chosen.
Access: 
Public
Bullets Statistics for Business and Economics

Bullets Statistics for Business and Economics

12. Multiple Regression

  • Regression objectives are either to predict the value of the dependent variable, or to estimate the marginal effect of each independent variable.
  • A population multiple regression model is a model that includes multiple independent variables.
  • Standard multiple regression assumptions include the four standard simple regression assumptions, plus a fifth one: It is not possible to find a set off nonzero numbers such that the sum of the coefficients equals zero.
  • Multiple regression models include an error term, ε, that represents variability caused by variables not included in the model.
  • In multiple regression coefficients are estimated using least squares, but these estimates become less reliable the higher the correlations between independent variables are.
  • Any regression coefficient in a multiple regression model is dependent on all independent variables, and are thus referred to as conditional coefficients.
  • Mean square regression (MSR) shows the proportion of the variability by the dependent variable that can be explained by the regression model.
  • In a multiple regression model the sum-of-squares (SST; or sample variability) can be split into the sum of squares regression (SSR; or explained variability) and the sum of squares error (SSE; or unexplained variability). This is referred to as sum-of-squares decomposition.
  • The coefficient of determination, R2, describes the strength of the linear relationship between the independent variables and the dependent variables, and is calculated by 1 – SSE/SST.
  • Adding more independent variables leads to a misleading increase in R2, which can be avoided by calculating the adjusted coefficient of determination.
  • The coefficient variance estimator, s2b, is calculated as:
    The square root of s2b is the coefficient standard error.
  • Multiple regression models can be transformed into non-linear models, namely quadratic models and logarithmic models.
  • Dummy variables can be used to represent categorical data in a regression model, and have a value of either 0 or 1.

 

13. Additional Topics in Regression Analysis

  • Models are developed through four steps: model specification (selecting the variables, the algebraic form, and the data), coefficient estimation, model verification (checking whether the model is still accurate), and interpretation and inference.
  • Dummy variables can be used to represent more than two categories by using multiple dummy variables. The rule is: number of categories -1 = number of dummy variables.
  • In time series data the values of the dependent variable are related, this is then referred to as a lagged dependent variable.
  • Not including important independent variables in a model can make any conclusions drawn from this model faulty.
  • Multicollinearity is the phenomenon of two highly correlated independent variables. This leads to misleading estimated coefficients.
  • Correlations between error terms are called auto-correlated errors. This leads to the estimated standard errors for the coefficients being biased, the null hypotheses falsely being rejected, and confidence intervals being too narrow. Autocorrelation can be formally tested with the Durbin-Watson test.

15. Analysis of Variance

  • An Analysis of Variance (ANOVA) can be used to analyze data at more
  • .....read more
Access: 
Public
Oefenvragen Statistics for Business and Economics

Oefenvragen Statistics for Business and Economics


12. Multiple Regression

1. A study was conducted to assess the influence of various factors on the start of new firms in the agricultural industry. For a sample of 70 countries the following model was estimated:

yn = -59.31 + 4.983x1 + 2.198x2 + 3.816x3 - 0.310x4 11.1562 10.2102 12.0632 10.3302

-0.886x5 + 3.215x6 + 0.85x7 13.0552 11.5682 10.3542

R2 = 0.766

where:

yn = new business starts in the industry

x1 = population in millions

x2 = industry size

x3 = measure of economic quality of life

x4 = measure of political quality of life

x5 = measure of environmental quality of life

x6 = measure of health and educational quality of life

x7 = measure of social quality of life

The numbers in parentheses under the coefficients are the estimated coefficient standard errors.

a. Interpret the estimated regression coefficients.

b. Interpret the coefficient of determination.

c. Find a 90% confidence interval for the increase in new business starts resulting from a one-unit increase in the economic quality of life, with all other variables unchanged.

d. Test, against a two-sided alternative at the 5% level, the null hypothesis that, all else remaining equal, the environmental quality of life does not influence new business starts.

e. Test, against a two-sided alternative at the 5% level, the null hypothesis that, all else remaining equal, the health and educational quality of life does not influence new business starts.

f. Test the null hypothesis that, taken together, these seven independent variables do not influence new business starts.

2. Based on 25 years of annual data, an attempt was made to explain savings in Japan. The model fitted was as follows:

y = b0 + b1x1 + b2x2 + e

where

y = change in real deposit rate

x1 = change in real per capita income

x2 = change in real interest rate

The least squares parameter estimates (with standard errors in parentheses) were (Ghatak and Deadman 1989) as follows:

b1 = 0.097410.02152 b2 = 0.37410.2092

The adjusted coefficient of determination was as follows:

R2 = .91

a. Find and interpret a 99% confidence interval for b1.

b. Test, against the alternative that it is positive, the null hypothesis that b2 is 0.

c. Find the coefficient of determination.

d. Test the null hypothesis that b1 = b2 = 0.

e. Find and interpret the coefficient of multiple correlation.

3. Based on data from 63 countries, the following model was estimated by least squares:

yn = 0.58 - .052x1 - .005x2 R2 = .17

1.0192 1.0422

where:

yn = growth rate in real gross domestic product

x1 = real income per capita.....read more

Access: 
Public
Follow the author: Dara Yapp
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Comments, Compliments & Kudos:

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
Check how to use summaries on WorldSupporter.org


Online access to all summaries, study notes en practice exams

How and why would you use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, study notes en practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the menu above every page to go to one of the main starting pages
    • Starting pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
  2. Use the topics and taxonomy terms
    • The topics and taxonomy of the study and working fields gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
  3. Check or follow your (study) organizations:
    • by checking or using your study organizations you are likely to discover all relevant study materials.
    • this option is only available trough partner organizations
  4. Check or follow authors or other WorldSupporters
    • by following individual users, authors  you are likely to discover more relevant study materials.
  5. Use the Search tools
    • 'Quick & Easy'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject.
    • The search tool is also available at the bottom of most pages

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Field of study

Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private
Statistics
953