Je vertrek voorbereiden of je verzekering afsluiten bij studie, stage of onderzoek in het buitenland
Study or work abroad? check your insurance options with The JoHo Foundation
There are situations and experiments that require processes to be compared at more than two levels. Data from such experiments can be analysed using analysis of variance or ANOVA.
There are other ways to compare population means than ANOVA, but these are based on the assumption of either paired observations or independent random samples, and can only be used to compare two population means. ANOVA can be used to compare more than two populations, and also uses assessments of variation, which forms a large problem in other methods.
The procedure for testing the equality of population means is called a one-way ANOVA. This procedure is based on the assumption that all included populations have a common variance.
The total sum of squares (SST) in this procedure is made up of a within-group sum of squares (SSW) and a between groups sum of squares (SSG): SST = SSW + SSG
This division of the SST forms the basis of the one-way ANOVA, as it expresses the total variability around the mean for the sample observations.
If the null hypothesis is true (all population means are the same) then both SSW and SSG can be used to estimate the common population variance. This is done by dividing by the appropriate number of degrees of freedom.
Because SSW and SSG both provide an unbiased estimate of the common population variance if the null hypothesis is true, a difference between the two values indicates that the null hypothesis is false. The test of the null hypothesis is thus based on the ratio of mean squares:
Where and . With the assumptions that the population variances are equal and the population distributions are normal.
The closer the ratio is to 1, the less indication there is that the null hypothesis is false.
These results are also summarized in a one-way ANOVA table, which has the following format:
Source of Variation | Sum of Squares | Degrees of Freedom | Mean Squares | F-ratio |
Between groups | SSG | K – 1 | MSG | MSG/MSW |
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
Simple regression (see chapter 11) can predict a dependent variable as a function of a single independent variable. But often there are multiple variables at play. In order to determine the simultaneous effect of multiple independent variables on a dependent variable, multiple regression is used. The least squares principle fit the model.
As with simple regression, the first step in the model development is model specification, the selection of the model variables and functional form of the model. This is influenced by the model objectives, namely: (1) predicting the dependent variable, and/or (2) estimating the marginal effect of each independent variable. The second objective is hard to achieve, however, in a model with multiple independent variables, because these variables are not only related to the dependent variable but also to each other. This leaves a web of effects that is not easily untangled.
To make multiple regression models more accurate an error term “ε” is added, as a way to recognize that none of the described relationships in the model will hold exactly and there are likely to be variables that affect the dependent variable, but are not included in the model.
Multiple regression coefficients are calculated with the least squares procedure. However, again this is more complicated than with simple regression, as the independent variables not only affect the dependent variable but also each other. It is not possible to identify the unique effect of each independent variable on the dependent variable. This means that the higher the correlations between two or more of the independent variables in a model are, the less reliable the estimated regression coefficients are.
There are 5 assumptions to standard multiple regression. The first 4 are the same as are made for simple regression (see chapter 11). The 5th states that it is not possible to find a set of nonzero numbers such that the sum of the coefficients equals 0. This assumption excludes the cases in which there is a linear relationship between a pair of independent variables. In most cases this assumption will not be violated if the model is properly specified.
Whereas in simple regression the least squares procedure finds a line that best represents the set of points in space, multiple regression finds a plane that best represents these points (as each variable is represented with its own dimension).
It is important to be aware of the fact that in a multiple regression it is not possible to know which independent variable predicts which change in the dependent variable. After all, the slope coefficient estimated is affected by the correlations between all independent and dependent variables. This also means that any multiple regression coefficient is dependent on all independent variables in the model. These coefficients are thus referred to as conditional coefficients. This is the case is all multiple regression models unless there are two independent variables with a sample correlation of zero (but this is very unlikely). Because of this.....read more
This chapter focusses on topics that add to the understanding of regression analysis. This includes alternative specifications for these models, and what happens in the situations where basic regression assumptions are violated.
The goal when developing a model is to approximate the complex reality as close as possible with a relatively simple model, which can then be used to provide insight into reality. It is impossible to represent all of the influences in the real situation in a model, instead only the most influential variables are selected.
Building a statistical model has 4 stages:
Dummy variables were introduced in chapter 12 as a way to include categorical variables in regression analysis. Further uses for these variables will be.....read more
There are situations and experiments that require processes to be compared at more than two levels. Data from such experiments can be analysed using analysis of variance or ANOVA.
There are other ways to compare population means than ANOVA, but these are based on the assumption of either paired observations or independent random samples, and can only be used to compare two population means. ANOVA can be used to compare more than two populations, and also uses assessments of variation, which forms a large problem in other methods.
The procedure for testing the equality of population means is called a one-way ANOVA. This procedure is based on the assumption that all included populations have a common variance.
The total sum of squares (SST) in this procedure is made up of a within-group sum of squares (SSW) and a between groups sum of squares (SSG): SST = SSW + SSG
This division of the SST forms the basis of the one-way ANOVA, as it expresses the total variability around the mean for the sample observations.
If the null hypothesis is true (all population means are the same) then both SSW and SSG can be used to estimate the common population variance. This is done by dividing by the appropriate number of degrees of freedom.
Because SSW and SSG both provide an unbiased estimate of the common population variance if the null hypothesis is true, a difference between the two values indicates that the null hypothesis is false. The test of the null hypothesis is thus based on the ratio of mean squares:
Where and . With the assumptions that the population variances are equal and the population distributions are normal.
The closer the ratio is to 1, the less indication there is that the null hypothesis is false.
These results are also summarized in a one-way ANOVA table, which has the following format:
Source of Variation | Sum of Squares | Degrees of Freedom | Mean Squares | F-ratio |
Between groups | SSG | K – 1 | MSG | MSG/MSW |
Within groups | SSW | n – K | MSW | |
Total | SST | n – 1 |
|
|
It is also possible to calculate a minimum significant difference (MSD) between two sample means, as evidence to conclude whether the population means are different. This is done:
With sp being the estimate of variance (),.....read more
Time series data involves measurements that are ordered over time, in which the sequence of observations is important. Most procedures for data analysis cannot be used for this data, as these procedures are based on the assumption that the errors are independent. Thus, different forms of analysis are needed.
The main goal of analysing time-series data is to make predictions. An important assumption here is that the relations between variables remain constant.
Most time-series have the following four components:
Analysis of time-series data involves constructing a formal model in which most of these components are explicitly or implicitly present, in order to describe the behaviour of the data series. In building this model the series components can either be regarded as being fixed over time, or as steadily evolving over time.
Moving averages are the basis for many practical adjustment procedures. It can be used to remove the irregular component or smooth seasonal component:
Additionally moving averages are very suitable for detecting cyclical components and/or trends.
There are a various prediction methods, and the choice you make should always depend on the resources, the objectives, and the available data.
Simple exponential smoothing is a more basic prediction method that is appropriate when the series is non-seasonal and has no consistent trends. It predicts future values on the basis of an estimate of the current level of the time series. This estimate is comprised of a weighted average of current and past values, where most weight is given to the most recent observations (with decreasing weight.....read more
There are various ways of sampling a population, according to research and analysis goals.
Stratified sampling involves breaking the population into strata (a.k.a. subgroups) according to a specific identifiable characteristic in such a way that each member of the population belongs to only one strata. Stratified random sampling is the process of selecting independent simple random samples from each strata. A question that arises here Is how to allocate the sampling effort among the strata. There are various possibilities:
Analysing the results of stratified random samples is relatively straightforward, and any stratum sample mean (mj) can be used as an unbiased estimator of the population mean (μj). It can also be sued to estimate the population total, as this is the product of the population mean and the number of population members.
Various other sampling methods are:
13. Additional Topics in Regression Analysis
yn = -59.31 + 4.983x1 + 2.198x2 + 3.816x3 - 0.310x4 11.1562 10.2102 12.0632 10.3302
-0.886x5 + 3.215x6 + 0.85x7 13.0552 11.5682 10.3542
R2 = 0.766
where:
yn = new business starts in the industry
x1 = population in millions
x2 = industry size
x3 = measure of economic quality of life
x4 = measure of political quality of life
x5 = measure of environmental quality of life
x6 = measure of health and educational quality of life
x7 = measure of social quality of life
The numbers in parentheses under the coefficients are the estimated coefficient standard errors.
a. Interpret the estimated regression coefficients.
b. Interpret the coefficient of determination.
c. Find a 90% confidence interval for the increase in new business starts resulting from a one-unit increase in the economic quality of life, with all other variables unchanged.
d. Test, against a two-sided alternative at the 5% level, the null hypothesis that, all else remaining equal, the environmental quality of life does not influence new business starts.
e. Test, against a two-sided alternative at the 5% level, the null hypothesis that, all else remaining equal, the health and educational quality of life does not influence new business starts.
f. Test the null hypothesis that, taken together, these seven independent variables do not influence new business starts.
y = b0 + b1x1 + b2x2 + e
where
y = change in real deposit rate
x1 = change in real per capita income
x2 = change in real interest rate
The least squares parameter estimates (with standard errors in parentheses) were (Ghatak and Deadman 1989) as follows:
b1 = 0.097410.02152 b2 = 0.37410.2092
The adjusted coefficient of determination was as follows:
R2 = .91
a. Find and interpret a 99% confidence interval for b1.
b. Test, against the alternative that it is positive, the null hypothesis that b2 is 0.
c. Find the coefficient of determination.
d. Test the null hypothesis that b1 = b2 = 0.
e. Find and interpret the coefficient of multiple correlation.
yn = 0.58 - .052x1 - .005x2 R2 = .17
1.0192 1.0422
where:
yn = growth rate in real gross domestic product
x1 = real income per capita.....read more
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Field of study
Je vertrek voorbereiden of je verzekering afsluiten bij studie, stage of onderzoek in het buitenland
Study or work abroad? check your insurance options with The JoHo Foundation
Add new contribution