Samenvatting Multivariate Data Analysis (Hair)

Deze samenvatting is gebaseerd op het studiejaar 2013-2014.

Chapter 1

The data analyst is referred to as the researcher. Multivariate analysis refers to all statistical techniques that simultaneously analyze multiple measurements on individual objects under investigation.

  • factor analysis: identifies structure underlying a set of variables

  • discriminant analysis: differentiates among groups based on a set of variables

    To be considered multivariate: all variables must be random and interrelated in such ways that their different effects cannot meaningfully be interpreted separately. The multivariate character lies in the multiple variates and not only in the number of variables observed.

Basic concepts

  • The variate: a linear combination of variables with empirically determined weights. Variables are determined by the researcher, the weights by the multivariate technique. The result is a single value representing a combination of the entire set of variables that best achieves the objective of the specific multivariate analysis.

  • Measurement scales: a researcher cannot identify variation unless it can be measured. Data can be divided into 2 types: metric and nonmetric.
    + Nonmetric data: these are qualitative measures. They describe differences in kind by indicating the presence or absence of a characteristic of a property. In this category we have:

  1. Nominal scales: these provide the number of occurrences in each class

  2. Ordinal scales: Here we see an order and we can rank the classes, but the distance between the classes is unknown.

+ Metric Data: these are quantitative measures and describe differences in the amount/degree in a particular attribute.
In this category we have interval and ratio scales. These scales are very similar because there are constant units of measurement. The only difference is that interval data have an arbitrary zero point, where ratio scales have an absolute zero point.

The impact of choice of measurement scales

Understanding the difference in the different types of measurement scales is important for 2 reasons:

  1. The researcher must identify the measurement scale of each variable used.

  2. It is important for determining which multivariate techniques should be used.

Measurement error is the degree to which the observed values ore not representing the true values. There are 2 important characteristics of a measure:

  • validity: Does the measure represent what it is supposed to?

  • reliability: the degree to which the observed variable measures the true value and is free of error

To reduce the measurement error there are multivariate measurements(summated scale) in which several variables are joined in a composite measure to represent a concept.

 

Reality

 

 

Statistical

 

No difference

Difference

Decision

H0: no difference

1-α

Β (type II error)

 

H1: difference

Α (type I error)

1-β

Power is determined by 3 factors:

  1. The effect size

  2. Alpha (α); the more restrictive alpha, the less power

  3. Sample size; the larger the sample size, the greater the power

The relationship between these variables is shown in table 1-1.

A classification of multivariate techniques, based on 3 judgments

  1. Can the variables be divided into independent/ dependent classifications based on some theory?

  2. If they can, how many variables are treated as dependent in a single analysis?

  3. How are the variables measured?

Dependence technique: the dependent variable is explained by other variables.
 

 

 

# of dependent

variables

Type of measurement

 

Single

Multiple

Scale

Metric

Multiple regression, conjoint analysis

Multivariate analysis of variance ( if IV is nonmetric), Canonical (if IV is metric)

 

Non-metric

Multiple discriminant analysis, linear probability

Dummy variable coding -> canonical

 

Interdependence techniques: variables cannot be classified as either dependent or independent, but all variables are analyzed simultaneously in order to find an underlying structure to the entire set of variables/ subjects.

The structure of the variables should be analyzed -> factor analysis and confirmatory factor analysis
The variables should be grouped to represent structure -> cluster analysis

Types of multivariate techniques:

  1. Principal components &common factor analysis
    To analyze interrelationships among a large number of variables and to explain these variables in terms of their common underlying dimensions. The objective is to find a way of condensing the information contained in a number of original variables into a smaller set of variates with minimum information loss.

  2. Multiple regression & multiple correlation
    The objective is to predict changes in the dependent variable in response to changes of the independent variable.

  3. Multiple discriminant analysis & logistic regression
    Applicable in situations in which the total sample can be divided into groups based on a nonmetric dependent variable characterizing several known classes. The main objective is to understand group differences and predict the likelihood that an entity belongs to a particular group. It might be used to distinguish innovators from non-innovators according to demographic and psychographic profiles. Logistic regression models are a combination of multiple regression and multiple discriminant analysis. These are similar to MRA, but here the dependent variable is non-metric.

  4. Canonical correlation analysis
    The objective is to correlate simultaneously a single metric dependent variable with several metric independent variables.

  5. Multivariate analysis of variance &covariance
    This method can be used to explore the relationship between several categorical independent variables and 2 or more dependent, metric dependent variables. MANCOVA can be used in conjunction with MANOVA to remove the effect of any uncontrolled metric independent variables.

  6. Conjoint analysis
    This method is allowing for the evaluation of complex products while maintaining a realistic decision context for the respondent.

  7. Cluster analysis
    This method is developing meaningful subgroups of individuals or objects the objective is to classify a sample of entities into a smaller number of mutually exclusive groups based on similarities among entities. The groups are not predefined. There are 3 steps

  8. Measurement of some form of similarity/ association among entities to determine the number of groups

  9. the actual clustering process

  10. Profile variables to determine their composition

  11. Perceptual mapping, multidimensional scaling
    This method is used to transform consumer judgments of similarity/ preference into distances represented in a multidimensional space.

  12. Correspondence analysis
    This method facilitates perceptual mapping

  • contingency tables

  • nonmetric data transformed to metric

  • dimensional reduction

  • perceptual mapping

  1. Structural equation modelling and confirmatory factor analysis

 

SEM allows separate relationships for each set of dependent variables, there are 2 components :

  1. Structural model: the path model, which relates to the independent and the dependent variables.

  2. Measurement model: this enables the researcher to use several variables for a single independent/dependent variable

In a confirmatory analysis the researcher can assess the contribution of each scale item as well as to incorporate how well the scale measures the concept.

Guidelines for multivariate analysis and interpretation.

  • Establish practical significance as well as statistical significance. There should be a focus on the practical side: what are the implications?

  • Recognize that sample size affects all results. For a small sample size, multivariate analysis may lead to too little statistical power to identify statistical results or too easily overfitting of the data. A similar impact occurs with too large sample sizes.

  • Know your data. There is a tendency to accept the results without typical examination one undertakes with univariate analysis.

  • Striving for model parsimony. The researcher must avoid inserting variables indiscriminately and letting the multivariate technique sort out relevant variables with for 2 reasons:

  1. Irrelevant variables usually increase the ability to fit sample data, but at the expense of overfitting the sample data and making results less generalizable.

  2. Irrelevant variables mask true effects due to multicollinearity. This is the degree to which any variable’s effect can be predicted by the other variables

  • Look at your errors

  • Validate your results

 

The researcher must make sure that there are sufficient observations per estimated parameter to avoid overfitting. Efforts to validate:

  1. Splitting the sample and using one subsample

  2. Gathering a separate subsample

  3. Employing a bootstrapping technique

A structured approach to multivariate model building

A six-step approach is presented

  1. Define the research problem, objectives and multivariate technique

  2. Develop an analysis plan

  3. Evaluate the assumptions underlying the multivariate technique

  4. Estimate the multivariate model and assess overall model fit, meet specific characteristics of the data or maximize the fit to the data

  5. Interpret the variates

  6. Validate the multivariate model

 

These steps should be displayed in a flowchart. It consists of 2 sections:

Stage 1-3 -> section 1 -> issues addressed when preparing for actual model estimation

Stage 4-6-> section 2 -> model estimation, interpretation and validation

 

Chapter 2

Examining your data

Key terms are stated on page 32-34.the more complex an analysis becomes, the need and level of understanding increases. The starting point for understanding the nature of any variable is to characterize the shape of its distribution. Many times the researcher can get a perspective of the variable by creating a histogram. It is a graphical representation of a single variable that represents the frequency of occurrences within categories. If the histogram is bell shaped, it is a normal distribution. The histogram can be used to examine every kind of metric variable. If the researcher is interested in examining the relations between two, creating a scatterplot may be useful. The points in the graph represent joint values of the variables for any case. The patterns of these points predict a relationship between the variables. If the points are close to each other and showing a straight line, there is a linear relationship or correlation. A curved pattern may show a nonlinear relationship, and random points may show no relationship.
The researcher also has to understand the extent and character of differences. Assessing group differences is done through univariate analysis such as t-tests and analysis of variance and through multivariate techniques of discriminant analysis and multivariate analysis of variance. Another important aspect is identifying outliers. This can be done by making a boxplot. The upper and lower quartiles of the data distribution form the upper and lower boundaries of the box (25th-75th percentile). The box presents 50 percent of the data values. The larger the box, the greater the spread. The median is depicted by a solid line in the box. Outliers (1-1.5 quartiles away from the box) and extreme values (more than 1.5 quartiles away from the box) are displayed outside of the whiskers. Sometimes a researcher needs to compare observations characterized on a multivariate profile, in that case a number of multivariate displays center around one of three types of graphs.

  1. Direct portrayal of data values either by (a) glyphs or metroglyphs, which are some form of circle with radii that correspond to a data value, or (b) multivariate profiles, which portray barlike profile for each observation.

  2. Mathematical transformation of the original data into a mathematical relationship, which can be portrayed graphically. Andrew’s Fourier transformation is the most common.

  3. Graphical displays with iconic representativeness, the most popular being a face.

Missing data

Missing data, where valid values on one or more variables are not available for analysis, are a fact of life in multivariate analysis. To identify patterns in the missing data that would characterize the missing data process, the researcher asks questions like:

  1. Are the missing data randomly throughout the observations or are distinct patterns identifiable?

  2. How prevalent are the missing data?
    Both substantive and practical considerations necessitate an examination of the missing data process:
    - the practical impact of missing data is the reduction of the sample size available for analysis.
    - from the substantive perspective any statistical result based on data with a nonrandom missing data process could be biased.
    The concern for missing data processes is similar to the need to understand the causes of non-response in the data collection process.
    A four step process for identifying missing data and applying remedies:

    1. Determine the type of missing data
    Are the missing data part of the research design 1or are the causes truly unknown?
    - Ignorable missing data: remedies are not needed. The missing data process is operation at random. There are three instances in which a researcher most often encounters ignorable missing data:
    (1) Resulting from taking a sample from the population instead of using the entire population
    (2) Due to a specific design of the data collection process, for example skipping questions that are not applicable.
    (3) When data are censored. Censored data are observations not complete because of their stage in the missing data process. A typical example is causes of death, people who are still living cannot provide all information about this.
    - Non-ignorable missing data: these fall in two categories based on their source: known versus unknown processes:
    (1) Many missing data processes are known to the researcher in that they can be identified due to procedural factors, such as errors in data entry that create invalid codes, disclosure restrictions, failure to complete the entire questionnaire, or morbidity of the respondent. Researcher has little control over these factors.
    (2) Unknown missing data processes are less easily identified and accommodated. These instances are often related directly to the respondent.
    2. Determine the extent of missing data
    the researcher must examine the pattern of missing data and determine the extent of missing data for individual variables, individual cases and even overall. The primary issue is to explore if the missing data affect the outcomes. If it has a small effect, any of the approaches can be chosen. If the effect is larger, we first have to determine the randomness of the missing data process before selecting a remedy step.
    The most direct means of assessing the extent of missing data is by tabulating the percentage of missing variables for each case and the number of cases with missing data for each variable. The researcher should look for any nonrandomness in the data. Finally the researcher should determine the number of cases with no missing data on any of the variables which will provide the sample size for analysis if remedies are not applied. If it is determined that the extent is acceptably low and no specific randomness patterns appear, the researcher can apply any of the imputation techniques. Before proceeding to the formalized methods, the researcher should consider the simple remedy of deleting offending cases with excessive levels of missing data . This should always be based on both empirical and theoretical considerations (Rules of thumb 2-2).
    3. Diagnose the randomness of the missing data process
    Diagnosing the randomness is necessary to determine the appropriate remedies. Of the two levels of randomness when assessing missing data, one requires special methods to accommodate a nonrandom component (Missing at Random, MAR). A second level (Missing Completely at Random, or MCAR) is sufficiently random to accommodate any type of missing data remedy. Only MCAR allows for the use of any remedy desired.

The distinction between these two levels is in the generalizability to the population, as described here:

  • MAR: if the missing values of Y depend on X but not on Y. ( the observed Y values represent a random sample of the actual Y samples for each value of X, but not necessarily truly random for all Y values)
  • MCAR: the observed values of Y are a truly random sample of all Y values, with no underlying process that causes bias to the observed data.
    As sample size and the number of variables increases, so does the need for empirical diagnostic tests. This can be done in SPSS( Missing value analysis). These tests generally include one or both diagnostic tests:
  • The first assesses the missing data process of a single variable Y by forming two groups: observation with missing data for Y and observations with valid data for Y. Statistical tests are then performed to test if there are significant differences. Significant differences indicate the possibility of a nonrandom missing data process. If the variable being compared is metric, t-tests are performed.
  • The second is an overall test of randomness that determines whether the missing data can be classified as MCAR. The pattern of missing data on all variables is analyzed and compared with the pattern expected for a missing data process. If no significant difference is found, the data can be classified as MCAR.
  • As a result of these tests, the missing data process is classified as either MAR or MCAR.
     

4. Select the imputation method
Imputation is the proves of estimating the missing value based on valid values of other variables and/or cases in the sample. The objective is to employ known relationships that can be identifies in the valid values of the sample to assist in estimating the missing values. This should be done carefully because of the potential impact on the outcomes. All of the imputation methods are used for metric variables, nonmetric missings are usually left missing unless a specific modelling approach is employed.
Imputation of the MAR missing data process. The researcher should apply only one remedy- the specifically designed modeling approach. This set of procedures explicitly incorporates the missing data into the analysis, either through a process specifically designed for missing data estimation or as an integral portion of the standard multivariate analysis. The first approach involves maximum likelihood estimation techniques that attempts to model the process underlying the missing data and to make the most accurate and reasonable estimates possible. One example is the EM approach. Stage E makes the possible best estimates and stage M then makes estimates of the parameters assuming the missing data were replaced. Comparable procedures employ structural equation modeling to estimate the missing data. The second approach involves the inclusion of missing data directly into the analysis, defining observations with missing data as a select subset of the sample. This is most appropriate for missing data in the independent variables of a dependent relationship.
When the missing data occur on a nonmetric variable, the researcher can define those observations as a separate group and then include them in any analysis. When the missing data are present on a metric independent variable in a dependent relationship, the observations are incorporated into the analysis while maintaining the relationship among the valid values. The first step is to code all the observations that have missing values with a dummy variable.

Then the missing values are imputes by mean substitution method. Finally the relationship is estimated by normal means. The dummy variable represents the difference for the dependent variable between those observations with missing data and those observations with valid data. The coefficient of the original variable represents the relationship for all cases with non-missing data.

Imputation method of a MCAR missing data process. There are two basic approaches. The first approach is imputation using only valid data. This representation can be done in two ways:
- Complete Case approach: include only observations with complete data (listwise method in SPSS). The approach has two disadvantages:
1. It is most affected by any nonrandom missing data processes, because the cases with any missing data are deleted from the analysis.
2. This approach results in the greatest reduction in sample size, because missing data on any variable eliminates the entire case.
this approach is best suited for instances in which the extent of missing data is small, the sample is sufficiently large to allow for deletion and the relationships are strong.
- Using all-available data: This method imputes the distribution characteristics or relationships from every valid value (PAIRWISE method SPSS). This method is primarily used to estimate correlations and maximize pairwise information available in the sample. The distinguishing characteristic is that the characteristic of a variable is based on a potentially unique set of observations. Missing data are not replaced, but instead the obtained correlations are used on just the valid cases as representative for the entire sample. Several problems can still arise:
1. Correlations may be calculated that are out of range and inconsistent with other correlations. Any correlation between X and Y us constrained by their correlation to a third variable Z:
range of rxy = rxzryz +/-
the correlation between X and Y can very between -1 and 1 if X and Y have zero correlation with all other variables in the correlation matrix. As the correlations with other variables increase, the range of the correlations between X and Y decreases. , which increases the potential for the correlation in a unique set of cases to be inconsistent with correlations derived from other sets of cases. An associated problem is that the eigenvalues in the correlation matrix can become negative , thus altering the variance properties of the correlation matrix

Imputation using replacement values
* using known replacement values:
the common characteristic in these methods is to identify a known value, most often from a single observation, that is used to replace the missing data.
- Hot or cold deck imputation: in this approach the researcher substitutes a value from another source for the missing values. In the hot deck method the value comes from another observation in the sample that is deemed similar. Cold deck imputation derives the replacement value from an external source. Here the researcher must be sure that the replacement value from an external source is more valid than an internally generated value.
- case substitution: in this method, entire observations with missing data are replaced by choosing another non sampled observation.
 

* calculating replacement values:
The second basic approach involves calculating a replacement value from a set of observations with valid data in the sample.
- Mean substitution: one of the most widely used methods, mean substitution replaces missing values with the mean value for that variable. This approach has several disadvantages. It understates variance estimates by using the mean for all missing data. Second, the distribution of data is distorted. Third, this method depresses the observed correlation because all missing data will have a single constant value. However, this method is easily implemented.
- Regression imputation: in this method regression analysis is use to predict the missing values based on its relationship with other variables. First, a predictive equation is formed for each variable with missing data. Then replacement values for each missing value are calculated from that observations calculated in the predictive equation this method also has several disadvantages. First, it reinforces the relationships already in the data. Second, unless stochastic terms are added to the estimated data, the variance is understated. Third, this method assumes that the variable with missing data has substantial correlation with the other variables. Fourth, the sample size must be large enough to allow for a sufficient number of observations to be used in each prediction. Finally the regression equation is not constrained in the estimates it makes.
the imputation methods are summarized on page 53.

A recap of the missing value analysis: we can summarize 4 conclusions.

  1. The missing data process is MCAR. Such a finding provides 2 advantages to the researcher. First, it should not involve any hidden impact on the results that need to be considered when interpreting the results. Second, any of the imputation methods can be applied as remedies for the missing data.
  2. Imputation is the most logical course of action. Some form of imputation is needed in order to keep a sufficient sample size for any multivariate analysis.
  3. Imputed correlations differ across techniques. When estimating correlations among the variables in the presence of missing data , the researcher can choose between four different techniques: the complete case method, the all-available information method, the mean substitution method and the EM method. The researcher will however obtain different results by using different methods.
  4. Multiple methods for replacing the missing data are available and appropriate. The presence of several acceptable methods enables the researcher to combine estimates into a single composite, hopefully mitigating any effects strictly due to one of the methods

Outliers
Outliers are observations with a unique combination of characteristics identifiable as distinctly different from the other observations. In assessing the impact of outliers, we must consider practical an substantive considerations:

  • From a practical standpoint, outliers can have a marked effect on any type of empirical analysis.
  • In substantive terms, the outlier must be viewed in light of how representative it is of the population

Why do outliers occur? Outliers can be classified into 1 of 4 categories based on the source of their uniqueness:

  1. The first class arises from procedural error, such as a mistake in data entry.
  2. The second class is the result of an extraordinary event, which accounts for the uniqueness of the observation.
  3. The third class of outliers contains extraordinary observations for which the researcher has no explanation.
  4. The last class contains observations that are unique in their combination of values across the variables.

Detecting and handling outliers
Methods of detecting outliers

  • Univariate detection: examines the distribution of observations for each variable in the analysis and selects the outliers as those cases falling at the outer ranges of the distribution.
  • Bivariate detection: Pairs of variables can be assessed jointly through a scatterplot. Cases that fall markedly outside the range of the other observations will be seen as isolated points in the scatterplot. To assist in determining this two-dimensional portrayal, an ellipse representing a bivariate normal distribution’s confidence interval is superimposed over the scatterplot. This ellipse provides a graphical portrayal of the confidence limits and facilitates identification of the outliers. A variant of the scatterplot is the influence plot, with each point varying in size in relation to its influence on the relationship. A drawback of the bivariate method is the potentially large number of scatterplots that can arise.
  • Multivariate detection: when more than 2 variables are considered, the researcher needs a means to objectively measure the multidimensional position of each observation relative to some common point. This issue is addressed by Mahalanobis D2 measure. Higher D values represent observations located farther away from the general distribution in the multidimensional space. However, this method is only providing an overall view.

D2/df is approximately distributes as a t value. So if the t values are larger than 2.5 in small samples, or exceeding 3 or 4 for large samples these are outliers.

Outlier designation . The researcher must select only observations that demonstrate real uniqueness in comparison with the remainder of the population across as many perspectives as possible. The researcher must refrain from designating too many observations as outliers.

Outlier description and profiling. Once the outliers are identifies, the researcher should generate profiles of each outlier observation and identify the variables responsible for its being an outlier. Discriminant analysis and multiple regression analysis can be used. If possible the researcher should categorize the outlier into one of the 4 categories described before.

Retention or deletion of the outlier. After these steps, the researcher should decide on the retention or deletion of every outlier. They should be retained unless there is proof that they are truly aberrant and not representative of any of the observations in the population.

Testing the assumptions of the multivariate analysis

Some techniques are less affected by violating certain assumptions, which is termed robustness, but in all cases meeting some of the assumptions will be critical. The need to check assumptions is more important in multivariate analysis because of two characteristics of multivariate analysis. First, the complexity of the relationships, makes potential distortions and biases more potent. Second, the complexity of analyses and results may mask the indicators of assumption violations apparent in the simpler univariate analyses.

Assessing individual variables versus the variate. Multivariate analysis requires that the assumptions underlying the statistical techniques be tested twice. : for the separate variables and for the model variate.

Four important statistical assumptions
Four of the assumptions potentially affect every univariate and multivariate statistical technique.

1. Normality. The most fundamental assumption is normality, referring to the shape of data distribution for an individual metric variable and its correspondence to the normal distribution. If there is no normality, all statistical tests are invalid. Multivariate normality means that all individual variables are normally distributes and that the combinations are also normal. So if a variable is multivariate normal, it is also univariate normal. The reverse is not true.
- Assessing the impact of violating the normality distribution. The severity of non-normality is based on two dimensions; the shape of the offending distribution and the sample size.
* Impacts due to the shape of the distribution. There are two dimensions : Kurtosis, which refers to the peakedness or flatness of the distribution. If there are a lot of peaks this is called leptokurtic, when the distribution is flat this is called patykurtic. The second dimension is Skewness. This is concerning the balance of the distribution. Is it shifted to one side or centered and symmetrical?
* Impacts due to the sample size large samples sizes reduce the detrimental effects of non-normality.
- Graphical analysis of normality. The simplest check for normality is creating a histogram that compares the observed values with a distribution approximating the normal distribution., but this method is problematic for small sample sizes. A more reliable approach is the normal probability plot, which compares the cumulative distribution of actual data values with the cumulative distribution of a normal distribution. In figure 2.6, different normal probability plots are shown.
- Statistical tests of normality. An easy test is a rule of thumb based on the skewness and kurtosis values. The z value for the skewness is calculated as:
zskewness= , where N is the sample size

Zkurtosis

If either calculated z values exceeds the specified critical value, then the distribution is non-normal in terms of that characteristic. Specific statistical tests for normality are also available in SPSS. These are the Shapiro-Wilks test and a modification of the Kolmogorov-Smirnov test.
 

2.Homoscedacity.
This refers to the assumption that dependent variables exhibit equal levels of variance across the range of predictor variables. The variance od the dependent variable values must be relatively equal. If this is not the case, the relationship is heteroscedastic. The dependent variables should be metric, but the independent variables can either be metric or nonmetric. The two most common sources of heteroscedacity are:

1. Variable type. Many variables have a natural tendency toward differences in dispersion.
2. Skewed distribution of one or both variables
* graphical tests of equal variance dispersion. The test for homoscedacity is best examined graphically. (figure 2.7). the most common application is multiple regression. Boxplots also work well to show the degree of variation between groups formed by a categorical value.
*statistical tests for homoscedacity. The most common, the levene test, is used to assess whether the variances of a single metric variable are equal across any number of groups. If more than one variable is being tested, a Box M test can be used.
* Remedies for heteroscedacity. Heteroscedastic variables can be remedies through data transformations similar to those used to achieve normality.

3. Linearity.
*identifying nonlinear relationships. The most common way is to examine scatterplots of the variables and to identify any other nonlinear patterns in the data. An alternative approach is to run a simple regression and to examine the residuals. A third approach is to explicitly model a nonlinear relationship by the testing of alternative model specifications.
* remedies for nonlinearity. The most direct approach is to transform one or two variables to achieve linearity. An alternative is the creation of new variables.

4. Absence of correlated errors
*identifying correlated errors. Similar factors that affect one group may not affect another. If groups are analyzed separately, the effects are constant within each group. But if observations from both groups are combined, this can lead to biased results because an unspecified cause is affecting the estimation of the relationship. Another common source of correlated error is time series data. To identify correlated errors, the researcher must first identify potential causes. Values for a variable should be grouped and ordered on the suspected variable and then examines for any patterns.
*remedies for correlated errors. Correlated errors must be corrected by including the omitted causal factor into the multivariate analysis. The most common remedy is the addition of a variable that represents the omitting factor.

Overview of testing for statistical assumptions

  • Data transformations
    data transformations provide a means of modifying variables for one of two reasons:
    (1) To correct violations of the statistical assumptions underlying the multivariate techniques
    (2) To improve the relationship between variables

  • Transformations to achieve normality and homoscedacity

For nonnormal distributions, the most common patterns are flat distributions and skewed distributions. For the flat distribution, the most common transformation is the inverse (1/X). Skewed distributions can be transformed by taking the squared or cubed transformation for negative skewness, and logarithm and root for positive skewness. For heteroscedacity: if the cone opens to the right, taking the inverse is the best transformation. If the cone opens to the left: take the square root.

  • Transformation to achieve linearity

Numerous procedures are available for achieving linearity between two variables, but most simple nonlinear relationships can be placed in one of four categories (figure2.8).

Incorporating nonmetric data with dummy variables

In many instances metric data must be used as independent variables. A researcher has available a method for using dichotomous variables, known as dummy variables, which act as replacement variables for the nonmetric variable. Any nonmetric value with k categories can be represented as k-1 dummy variables. In constructing dummy variables, two approaches can be used to represent the categories, and more importantly, the category that is omitted, known as the reference category or comparison group.

  • The first approach is known as indicator coding.

An important consideration is the reference category, the category that received all zeros for the dummy variables. The deviations represent the differences between the dependent variable mean score and the comparison group. This form is most appropriate in a logical comparison group.

  • An alternative method is effects coding. It is the same as indicator coding except that the comparison group is given -1 score instead of 0 for the dummy variables.

Chapter 3

The key terms are listed on page 90-92.

What is factor analysis?
Factor analysis is an interdependence technique, with as main goal to determine the underlying structure among the variables in the analysis. Variables are the building blocks of relationships. As we add more variables, more and more overlap will exist between those variables. As the variables become more correlated, the researcher now needs ways in which to manage these variables. Factor analysis provides the tools for analyzing the structure of the interrelationships among a large number of variables. By defining sets of variables that are highly interrelated, called factors. If we have a conceptual basis for understanding the relationships between variables, these dimensions (factors) may have meaning for what they collectively represent.

Many researchers call it only exploratory, useful for searching structure among a set of variables or as a variable reduction method. However, sometimes it can also be used to test hypotheses, as a confirmatory method.

Factor analysis decision process

Stage 1- objectives
the general purpose is to find a way to condense the information contained in a number of original variables into a smaller set of new, composite dimensions or variates with a minimum loss of information. In meeting its objectives there are 4 issues:

  • Specifying the units of analysis. If the objective is to summarize the characteristics, a correlation matrix would be created. This is done by conducting R factor analysis. Factor analysis may also be applied to a correlation matrix of the individual respondents based on their characteristics. This is called a Q factor analysis and combines a large number of people into distinctly different groups based on their characteristics. A cluster analysis can also be used to group people.
  • Achieving data summarization and /or data reduction. The fundamental concept in data summarization is structure. Through structure the researcher can view the set of variables at different levels of generalization, where groups of individuals are viewed for what they represent collectively. In factor analysis all variables are simultaneously considered no matter if they are dependent or independent. It still employs the concept of the variate, but not in predicting dependent variables, but in explaining the entire variable set. Structure is defined by the interrelatedness among variables allowing for the specification of a smaller number of dimensions(factors) representing the original set of variables. Factor analysis can also be used to achieve data reduction by either identifying representative variables from a much larger set of variables for use in subsequent multivariate analysis or by creating an entirely new set of variables to replace the original set of variables. Estimates of the factors and the contributions of each variable to the factors(loadings) are all that’s required to the analysis with data summarization. In data reduction, these loadings are also important, but it uses them for either identifying variables for subsequent analysis with other techniques or making estimates for the factors themselves.
  • Variable selection. The researcher implicitly specifies the dimensions that can be identified through the character and nature of the variables submitted to factor analysis. The researcher also must remember that factor analysis always creates factors. It is a potential candidate for ‘garbage in, garbage out’.
  • Using factor analysis results with other multivariate techniques. Variables determined to be highly correlated and members of the same factor would be expected to have similar profiles across groups in multivariate analysis. Knowledge of the structure of the variables itself would give the researcher a better understanding of reasoning behind the entry of variables in this technique.
     

Stage 2- Designing a factor analysis
the design of a factor analysis involves 3 basic decisions:

  1. Calculation of the input data to meet the specified objectives of grouping variables or respondents. With R type analysis, the researcher uses the traditional correlation matrix as input (correlations among variables). In the Q type analysis this correlation matrix would be based on the correlation between individual respondents.
    What is the difference between Q type analysis and cluster analysis? Q type analysis is based on the intercorrelation among correspondence, whereas cluster analysis forms groupings based on a distance-based similarity measure between respondents.
  2. Design of the study in terms of number of variables, measurement properties of variables and the types of allowable variables. Two specific questions must be answered at this point.
    1. What type of variables can be used in factor analysis? Metric variables are the best option, nonmetric variables can cause problems. In case of nonmetric variables, dummy variables should be created.
    2. How many variables should be included? The researcher should attempt to minimize the number of variables included, but still maintain a reasonable number of variables per factor.
  3. 3. The sample size necessary. The researcher would generally not analyze samples of 50 or smaller, and preferably over 100. The researcher should try to obtain the highest cases-per-variable ratio to minimize chances of over fitting.
     

Stage 3- assumptions in factor analysis
the assumptions underlying factor analysis are more conceptual than statistical.
-Conceptual issues. A basic assumption is that some structure does exist in the set of selected variables. the presence of correlated variables and the subsequent definition of factors do not guarantee relevance, even if they meet the statistical requirements. The researcher must also make sure that the sample is homogeneous with regard to the underlying factor structure.
- Statistical issues. Some degree of multicollinearity is desirable, because the objective is to identify interrelated sets of variables. The next step is measuring the degree of intercorrelatedness and to check if this is sufficient to produce factors.
- Overall measures of correlation. The researcher must ensure that the data matrix has sufficient correlations to justify the application of factor analysis. If the correlations are low, or if all correlations are equal (which implies no structure exists) the researcher should question the application of factor analysis. Several approaches are available.
* If no correlations are larger than .3, factor analysis is inappropriate. The correlations can also be analyzed by computing partial correlations. A partial correlation is the correlation that is unexplained when the effects of other variables are taken into account. If partial correlations are high, factor analysis is inappropriate. Partial correlations above 0.7 are high.
* Another method is performing the Bartlett test of sphericity, a statistical test for the presence of correlations.
* A third measure is the measure of sampling adequacy. It ranges from 0-1 with 1 when each variable is perfectly explained without error by the other variables.
- Variable specific measures of intercorrelation. The researcher should examine the MSA values for each variable and exclude those falling in the unacceptable range. In deleting the variables, the researcher should first delete the variable with the lowest MSA value and then recalculate the factor analysis.
 

stage 4- deriving factors and assessing overall fit
Once the variables are specifies the researcher is ready to apply factor analysis to identify the underlying structure. In doing so, decisions should be made concerning the method of extracting the values and the number of factors selected to represent the underlying structure in the data.
- selecting the factor extraction method.
* Partitioning the variance of a variable in order to select a method the researcher should first have an understanding of the variance for a variable and how it is divided. When a variable is correlated with another variable, we say it shares variance with the other variable. The amount of sharing is the squared correlation. The total variance of any variable can be divided into 3 types:
1. Common variance: the variance that is shared with all other variables in the analysis. The communality is the estimate of the shared variance.
2. Specific variance : The variance only associated with a specific variable.
3. Error variance: variance due to unreliability in the data-collection process, measurement error, or a random component in the measured phenomenon.
*Common factor analysis versus component analysis
The selection of the method is based on two criteria: The objectives of the factor analysis and the amount of prior knowledge about the variance in the variables.
1. Component analysis is used when the objective is to summarize most of the original information in a minimum number of factors for prediction purposes. It considers the total variance and derives factors that contain small portions of unique variance and error variance. This method is the most appropriate when data reduction is a primary concern and when prior knowledge suggests that specific and error variance represent a relatively small proportion of the total variance.
2. Common factor analysis is used primarily to identify underlying factors or dimensions that reflect what the variables share in common. It considers only the common variance assuming that both the unique and error variance are not of interest in defining the structure of the variables. This method is most appropriate when the primary objective is to identify the latent dimensions or constructs represented in the original variables and the researcher has little knowledge about the amount of specific and error variance and wishes to eliminate variance. Common factor analysis is often viewed as more theoretically based. However, there are a view problems.
- Common factor analysis suffers from factor indeterminacy: for any individual respondent, several different factor scores can be calculated from a single factor model result.
- The communalities are sometimes not estimable or may be invalid

Although considerable debate remains over which factor model is the most appropriate, empirical research demonstrates similar results in many cases.
How do we decide on the number of factors to extract? The first factor may be viewed as the single best summary of linear relationships exhibited in data. The second factor is defined as the second-best linear combination of the variables, subject to the constraint that is orthogonal to the first factor. To be orthogonal to the first factor, the second factor must be derived from the variance remaining after the first factor has been extracted. The process continues extracting factors accounting for smaller and smaller amounts of variance until all of the variance is explained. In deciding when to stop factoring, the researcher must combine a conceptual foundation with some empirical evidence.

He generally begins with some predetermined criteria, such as the general number of factors plus some general thresholds of practical relevance. These criteria are combines with empirical measures of the factor structure. The following stopping criteria for the number of factors have been developed:
- Latent root criterion. Any individual factor should account for the variance of at least a single variable if it is to be retained for interpretation. Only factors having latent roots or eigenvalues of more than 1 are considered significant. This method is most reliable with a number of variables between 20 and 50.
- A priori criterion. When applying this method, the researcher already knows how many factors to extract before the factor analysis. The researcher instructs the computer to stop the analysis when the desired number of factors has been extracted.
- Percentage of variance criterion. This approach is based on achieving a specified cumulative percentage of total variance extracted by successive factors.
- Scree test criterion. The scree test is used to identify the optimum number of factors that can be extracted before the amount of unique variance begins to dominate the common variance structure.
- Heterogeneity of the respondents. Shared variance among variables is the baseis for both common and component analysis. If the sample is heterogeneous with regard to at least one subset of the variables, then the first factors will represent those variables that are more homogeneous across the entire sample.

Researchers usually use more than one method in determining how many factors to extract. After the factors are interpreted, the practicality is assessed. Negative consequences can arise from selecting either too many, or too few factors to represent the data. With the use of too few factors, the structure is not revealed. With the use of too many factors, the interpretation becomes too complex when the results are rotated.

Stage 5- interpreting the factors
A strong conceptual foundation is of great importance in interpreting the factors. To assist in the process of interpreting a factors and selecting final factor solution, three fundamental processes are defined.

  1. Estimate the factor matric. First, the un rotated factor matrix is computed, containing the factor loadings for each variable on each factor. Factor loadings are the correlation of each variable and the factor.
  2. Factor rotation. Factor rotation should simplify the structure. Rotation is applied to achieve simpler and theoretically more meaningful factor solutions. Ambiguities are reduced.
  3. Factor interpretation and re specification. As a final process the researcher evaluates the factor loadings for each variable in order to determine that variables role and contribution in determining the factor structure. The need may arise to respecify the factor model owing to the deletion of a variable from the analysis, the desire to employ a different rotational method, the need to extract a different number of factors or the desire to change from one extraction method to another.

Perhaps the most important tool in interpreting is factor rotation. The reference axes of the factors are turned about the origin until some other position has been reached. The ultimate effect of rotating the factor matrix is to redistribute the variance from earlier factors to later ones to achieve a simple, theoretically more meaningful factor pattern.

The simplest case of rotation is an orthogonal factor rotation, in which the axes are maintained at 90 degrees. When a rotation is not orthogonal, it is called oblique factor rotation. Those different types of rotation are shown in figure 3.7 and 3.8. Oblique rotation represents the clustering of variables more accurately and the oblique solution provides information about the extent to which the factors are actually correlated with each other.
 

Orthogonal Rotation Methods
By simplifying the rows, we mean bringing as many values as possible as close to 0 as possible. Three major approaches have been identified:

  1. Quartimax rotation is used to simplify the rows of a factor matrix. It focusses on rotating the initial factor so that a variable loads high on one factor and as low as possible on all other factors. It has not proved especially successful in producing simpler structures. It tends to produce a general factor as the first factor, on which most of the variables have high values.
  2. Varimax criterion centers on simplifying the columns of the factor matrix. The maximum simplification is reached when there are only 0s and 1s in a column. This method maximizes the sum of variances of required loadings of the factor matrix. Varimax seems to give a clear separation of the factors.
  3. Equimax approach is a compromise between the Quartimax and Varimax approaches. It tries to accomplish both simplification of rows and simplification of columns. It is, however, used infrequently.
     

Oblique Rotation Methods
These are similar to orthogonal methods, except that these allow correlated factors instead of remaining independence. Most statistical programs only offer limited options for oblique methods. (SPSS: OBLIMIN)
No specific rule has been developed for choosing the right method. In most cases the researcher uses what the computer program offers.

Judging the significance of factor loadings

  1. ensuring practical significance. A factor loading is the correlation of the variable and the factor, so the squared loading is the amount of the variables total variance accounted for by the factor. The larger the absolute size of the factor loading, the more important the loading in interpreting the factor matrix. We can assess the loadings as follows:
    - +/- .30 to +/- .40 meet the minimal level for interpretation of the structure.
    - +/- .50 or greater are practically significant
    - exceeding .70 are indicative of well-defined structure and are the goal of any factor analysis.
  2. Assessing statistical significance. Research has demonstrated that factor loadings have larger standard errors than typical correlations. The researcher can use the concept of statistical power in assessing significance for different sample sizes.
  3. Adjustments based on the number of variables. A disadvantage of both methods is that both of the approaches do not consider the number of variables As the researcher moves from the first factor to later factors, the acceptable level for a loading to be judges significant should increase.
  4. Interpreting a factor matrix. The researcher must sort through all the factor loadings to identify those most indicative of the underlying structure. Interpreting the complex interrelationships represented in a factor matrix represented in a factor matrix requires a combination of applying objective criteria with managerial judgment.
    The process can be simplified by following a 5-step process:

Step 1: examine the factor matrix of loadings
typically, the factors are arranged as columns thus, each column of numbers represents the loadings of a single factor. If an oblique method is used, two matrices of factor loadings are provides. The first is the factor pattern matrix, which has loadings that represent the unique contribution of each variable to the factor. The second is the factor structure matrix, which has simple correlations between variables and factors, but these loadings contain both the unique variance between the factors and the correlation among factors.
Step 2: identify the significant loadings for each variable
the interpretation should start with the first variable and move horizontally from left to right, looking for the highest loading for that variable on any factor. When the highest loading is found, it should be underlined if significant . Most factor solutions do not result in a simple structure solution. When a variable has more than one significant loading, it is termed a cross-loading.
Step 3: Assess the communalities of the variables.
Once the significant loadings are identifies, the researcher should look for any variables that are not adequately accounted for by the factor solution the researcher should view to communalities to assess whether the variables meet acceptable levels of explanation.
Step 4: respect the factor model if needed.
The researcher may find one of the following problems: (a) a variable has no significant loadings (b) even with a significant loading, the communality is deemed too low (c) a variable has cross-loading. The researcher can apply the following remedies:

  • Ignore those problematic variables and interpret the solution as it is.
  • Evaluate each of these variables for possible deletion.
  • Employ an alternative rotation method.
  • Decrease/ increase the number of factors retained.
  • Modify the type of factor model used.

Step 5: Label the factors
The researcher will examine all the significant variables for a particular factor and will attempt to assign a name or label selected to represent a factor that accurately reflects the variables loading on that factor. On each factor, like signs mean the variables are positively correlated and the opposite sign mean they are negatively correlated.

Stage 6- validation of factor analysis

In this stage the degree of generalizability is tested. It is especially relevant for the interdependence methods, because they describe a data structure that should be representative of the population as well.
The most direct method of validating the results is moving to a confirmatory perspective and assess the replicability of the results, either wit split sample in the original data set or with a separate sample. The comparison between two ore more factor model results has always been problematic, but several methods exist to make a comparison. Confirmatory factor analysis is one option, but several other options have also been proposed, rangin from a simple matching index to to programs designed specifically to assess the correspondence between factor matrices.
Another aspect of generalizability is th e stability of factor model results, factor stability is dependent on the sample size and the numbe of cases per variable. If the sample size permits, the researcher may wish to randomly split the sample into two subsets and estimate the factor model for each subset. Comparison of the two resulting factor matrices will provide an assessment of the robustness of the solution across the sample.
In addition to generalizability, another issue is important to the validation of factor analysis: The detection of influential observations. The researcher is encouraged to estimate the model with and without the observations identified as outliers to assess their impact on the results.

Stage 7- additional uses of factor analysis results
If the objective is to identify appropriate variables for subsequent anpplication on other statistical techniques, some form of data reduction will be employes. The two options include the following.

  • selecting the variable with the highest factor loading as a surrogate representative for a particular factor dimension.
  • replacing the original set of values with an entirely new, smaller set of variables created either from summated scales or factor scores.

Selecting surrogate variables for subsequent analysis
If the objective is to identify appropriate variables for subsequent analysis with statistical techniques, the researcher can chose to examine the factor matrix and select the variable with the highest factor loading on each factor to act as surrogate variable that is representative for that factor. This method has several disadvantages:

  • It does not address the issue of measurement error encountered when using single measures.|
  • It also runs the risk of potentially misleading results by selecting only a single variable to represent a perhaps more complex result.

Creating a summated scale
A summated scale creates two benefits:

  1. It provides a means of overcoming to some extent the measurement error. It reduces measurement error by using multiple indicators to reduce the reliance on a single response.
  2. It has the ability to represent the multiple aspects of a concept in a single measure. The summated scale combines the multiple indicators into a single measure representing what is held in common across the set of measures.

Four issues basic to the construction of the summated scale:

  1.  Conceptual definition. This specifies the theoretical basis for the summated scale by defining the concept being represented in terms applicable to the research context. Content validity is the assessment of the correspondence of the variables to be included in the summated scale and its conceptual definition. This form of validity, also known as face validity, subjectively assesses the correspondence between the individual items and the concept through ratings by expert judges.
  2. 2Dimensionality. The items should be unidimensional, meaning that they are strongly associated with each other and represent a single concept. The researcher can assess unidimensionality with either exploratory factor analysis, or confirmatory factor analysis.
  3. Reliability. Reliability is an assessment of the degree of consistency between multiple measurements of a variable. One form of reliability is test- retest, by which consistency is measured between the responses for an individual at two points in time. A second measure is internal consistency, which applies to the consistency among variables in a summated scale. Because no measure is perfect, we must rely on series of diagnostic measures to assess internal consistency.
    • The first measures relate to each seperable item
    • The second type of diagnostic measure is the reliability coefficient, crohnbachs alpha.
    • Also available are reliability measures derived from confirmatory factor analysis.
  4. Validity. The extent to which the scale or set of measures accurately represents the subject of interest. The three most widely used forms of validity are:
    • Convergent validity. It assesses the rhe degree to which two measures of the same concept are correlated.
    • Discriminant validity is the degree to which two conceptually similar concepts are distinct.
    • Nomological validity refers to the degree that the summated scale makes accurate predictions of the other concepts in a theoretical based model.

Calculating summated scales.
the most common approach is to take the average of the items in the scale. Reverse scoring is the process whereby the data values of a variable are reversed so that its correlations are now positive within the factor.

Computing factor scores
The third option for creating a smaller set of variables to replace the original set of variables is the computation of factor scores. Factor scores are also composite measures of each factor computed for each subject. The factor score is computed based on the factor loadings of all variables on the factor, whereas the summated scale is calculated by combining only selected variables. The only disadvantage is that they are not easily replicated across studies because they are based on the factor matrix.

Selecting among the three methods.

  • if data are used in the original sample or orthogonality must be maintained, factor scores are suitable
  • If generalizability is desired, summated scale or surrogate variables are more appropriate.
  • If a summated scale is untested and exploratory, surrogate variables should be considered.

Chapter 4

The key terms are listed on page 152-157
What is multiple regression analysis?

MRA is a statistical technique used to analyze the relationship between a single dependent variable and several independent variables. The objective is to use the independent variables to predict the single dependent value selected by the researcher. The set of weighted independent variables forms the regression variate, a linear combination of the independent variables that best predicts the dependent variable. The regression equation is the most widely known example of a variate among the multivariate techniques. It is required that both dependent and independent variables are metric. Sometimes it is possible to use nonmetric variables, by replacing them with dummy variables.
When there is a single independent variable, the statistical technique is called simple regression. When there are more than one independent variables involved, it is called multiple regressison.
Simple regression
For researchers, identifying the single independent variable that is the best prediction is the starting point. We can select this variable based on the correlation coefficient. The higher this coefficient, the stronger the relationship. In the regression equation, we represent the intercept as b0 and the regression coefficient as b1. With a method called least squares we can estimate the values of b0 and b1 such that the sum of squared errors of prediction is minimized. The prediction error is called the residual, e.

Interpretation of the simple regression model:

  • regression coefficient. If it is found significant, the value of the regression coefficient indicates the extent to which the independent variable is associated with the dependent variable.
  • intercept. The intercept has explanatory value only within the range of values for the independent variable. Its interpretation is based on the characteristics of the independent variable. The intercept only has interpretive value if zero is a conceptually valid value for the independent variable . If the independent variable represents a measure that never can have a true zero value , it aids in improving the prediction process, but has no exploratory value.

The most commonly used measure of predictive accuracy is the coefficient of determination (R2). It is calculated as the squared correlation between the actual and predicted values of the independent variable and represents the combined effects of the entire value in predicting the dependent variable. It ranges from 1 to 0. Another measure is the expected variation in the predicted values, termed the standard error of the estimate (SEe). It allows the researcher to understand the confidence interval that can be expected for any prediction from the regression model. Smaller confidence intervals denote greater predictive accuracy.
Multiple regression equation
The impact of multicollinearity. Collinearity is the association, measured as the correlation, between two independent variables. Multicollinearity refers to the correlation among three or more independent variables. Multicollinearity reduces the single independent variable’s predictive power by the extent to which it is associated with the other independent variables.

To maximize prediction from a given number of variables, the researcher should try to look for independent variables with low multicollinearity with the other independent variables. The task for the researcher is to expand upon the simple regression model by adding independent variables that have the greatest additional predictive power. The addition of extra independent variables is based on trade-offs between increased predictive power versus overly complex and even misleading regression models.

There is a six stage decision process for multiple regression analysis.
 

Stage 1- objectives of multiple regression.
the necessary starting point is the research problem. In selecting suitable applications for multiple regression, the researcher must consider three primary issues:
1. The appropriateness of the research problem
the ever-widening applications of MRA fall into two broad categories: prediction and explanation. Prediction involves the extent to which the regression equation can predict the dependent variable. This type of multiple regression fulfils one of two objectives. The first objective is to maximize the overall predictive power of the independent variables as represented in the variate. Multiple regression can also achieve a second objective of comparing two or more sets of independent variables to ascertain the predictive power pf each variables. This use of MRA is concerned with the comparison of results across two or more alternative or competing models. Explanation examines the regression coefficients for each individual independent variable and attempts to develop a substantive or theoretical reason for the effects of the independent variables. Interpretation of the variate may rely on any of three perspectives. The most direct interpretation is a determination of the relative importance of each independent variable in the prediction of the dependent measure. MRA assesses simultaneously the relationships between each independent variable and the dependent measure. MRA can in addition also afford the researcher a means of assessing the nature of the relationships between independent variables and the dependent variable. Finally MRA provides insights into the relationships among independent variables in their prediction of the dependent measure. These relationships are important because correlation between these variables may make some variables redundant in their predictive efforts and .
2. Specification of a statistical relationship
MRA is useful when the researcher wants to obtain a statistical relationship. A statistical relationship is characterized by two elements. First, when multiple observations are collected, more than one value of the independent variable will usually be observed for any of an independent variable. Second, based on the use of a random sample, the error in predicting the dependent variable is also assumed to be random. A statistical relationship estimated an average value, where a functional relationship estimates an exact value.
3. Selection of the dependent and independent variables
The success of any multivariate technique starts with the selection of variables. The researcher must specify which variables are dependent and which are independent. The researcher should always consider three issues: Strong theory, the selection of variables should always be based on theory; measurement error, this refers to the degree to which the variable is an accurate and consistent measure of the concept being studied.

Problematic measurement error may be addressed by using summated scales or structural equation modeling; specification error, this is probably the most problematic issue and it is concerning the inclusion of irrelevant variables or the omission of relevant variables from the set of independent variables. Both types of specification can have substantial impacts on any regression analysis. The first type can reduce model parsimony, the additional variables may mask the effects of more useful variables and the additional variables may make the testing of statistical significance less precise and reduce the statistical significance. The second type can bias the results and negatively affect any interpretation of the variables.

Stage 2- research design of a Multiple Regression Analysis
MRA can represent a wide range of dependence relationships, in which the researcher incorporates three features.
- Sample size: MRA maintains necessary levels of statistical power and practical/ statistical significance across a broad range of sample sizes.
Power levels in various regression models. In multiple regression power refers to the probability of detecting as statistically significant a specific level of R2. Sample size plays a role in assessing power, but also in anticipating the statistical power of a proposed analysis. The researcher can also consider the role of sample size in significance testing before collecting the data. Sample size affects the generalizability of the results by the ratio of observations to independent variables. The ratio should never fall below 5:1. As the ratio is lower, there is a risk of overfitting.
Degrees of freedom (df)= Sample size - number of estimated parameters
= Sample size – Number of independent variables +1

The larger the degrees of freedom, the more generalizable the results. If the number of independent variabales are reduced, the degrees of freedom increase.
- Unique elements of the dependence relationship: Independent variables are assumed to be metric and linear. However, these two assumptions can be relaxed by creating additional variables to represent these special aspects of the relationship.
Metric variables are required. However, sometimes nonmetric data should be incorporated in the analysis. In this cases we can use variable transformations. Our purpose is here to provide the researcher with a means to modify the independent or the dependent variable for one of two reasons. Firstly, to improve or modify the relationship between independent and dependent variables. Secondly, to enable the use of nonmetric variables in the regression variate. Data transformations may be based on reasons that are either theoretical or data derived. In each case the researcher must proceed many times by trial and error.
If we want to incorporate nonmetric data we can do this by using dummy variables. Each dummy variable represents one category of a nonmetric independent variable, and each nonmetric variable with K categories can be represented by K-1 dummy variables. One of the two forms of dummy variable coding is indicator coding. In this type of coding each category is represented by a ) or an 1. The regression coefficients for the dummy variables represent differences on the dependent variable for each group of respondence from the reference category. These group differences can be assessed directly, because the coefficients are in the same units as the dependent variable. This form of coding is most appropriate when a logical reference group is present, such as in an experiment.

The second type of coding is effects coding. It is the same as indicator coding, except that the comparison or omitted group is now given a value of -1 instead of 0 for the dummy variables. Now the coefficients represent differences for any group from the mean of all groups rather than from the omitted group.
Several types of data transformations are appropriate for linearizing a curvilinear relationship. Direct approaches involve modifying the values through some arithmetic transformation. These methods have several limitations. Firstly, these are applicable only in a simple curvilinear relationship. Secondly, they do not provide any statistical means for assessing whether the curvilinear or linear model is more appropriate. Finally, they accommodate only univariate relationships and not the interaction between variables when more than one independent variable is involved.
Now a means of creating new variables to explicitly model the curvilinear components is discussed. First, the curvilinear effect should be specified. Power transformations of an independent variable that add a nonlinear component for each additional power of the independent variable are known as polynomials. The power of 1 represents a linear relationship, the power of 2 represents a quadratic relationship and this is the first inflection point of a curvilinear relationship. The power of 3 (cubic relationship) adds a second inflection point. The cubic term is usually the highest power used. Multivariate polynomials are created when the regression equation contains two or more independent variables. We must here also add an interaction term (X1X2), which is needed for each variable combination to represent fully the multivariate effects. Now, the curvilinear effect should be interpreted. Multicollinearity can cause problems in assessing the statistical significance of the individual coefficients to the extent that the researcher should assess incremental effects as a measure of any polynomial terms in a three step process:

  1. Estimate the original regression equation
  2. Estimate the curvilinear relationship (original equation plus polynomial term)
  3. Assess the change in R2. If this is a significant change, a curvilinear relationship exists.

Common practice is to start with the linear component and then sequentially add higher order polynomials until there is nonsignificance. Polynomials also have their limitations. The first problem is concerning degrees of freedom. An additional term requires a degree of freedom , which may be restrictive. Also, multicollinearity is introduced by adding extra terms.
Now we can represent interaction or moderator effects. The nonlinear relationships discussed before require the creation of an extra variable to represent the changing slope of the relationship over the range of the independent variable. This variables focuses on the relationship between a single independent variable and the dependent variable. If an independent-dependent relationship is affected by another independent variable this is called a moderator effect. It changes the form of a relationship between an independent and the dependent variable. It is known as the interaction effect. The moderator term is a compound variable formed by multiplying X1 by the moderator X2, which is entered in the regression equation.
Y= b0+ b1X1 + b2X2+ b3X1X2
Because of multicollinearity effects, an approach similar to testing the significance of polynomial effects is employed. The researcher follows a three-step process:
1. Estimate the original equation
2. Estimate the moderated equation (original plus the moderator variable)
3. Assess the change in R2. If it significant, then a moderator effect is present.

Now, we can interpret the moderating effects. The interpretation of the regression coefficient changes slightly in the moderated relationship. The moderator effect(b3 coefficient) indicates the unit change in the effect of X1 as X2 changes. The B1 and B2 coefficients now represent the effects of X1 and X2, respectively, when the other independent variable is zero. In the unmoderated regression, the regression coefficient B1 and B2are averaged across all levels of the other independent variables, whereas in a moderated relationship, they are separate from the other independent variables. To determine the total effect of an independent variables, the separate and moderated variables have to be combined.
btotal=b1+b3X2
- Nature of the independent variables. MRA accommodates metric independent variables that are assumed to be fixed in nature as well as those with a random component.

Stage 3- Assumptions in Multiple Regression Analysis
The assumptions to be examined are:
1. Linearity of the phenomenon measured
2. Constant variance of the error terms
3. Independence of the error terms
4. Normality of the error term distribution
These assumptions apply for both individual variables as for relationships as a whole. This section focusses on examining the variate and its relationship with the dependent variable for meeting the assumptions of multiple regression. We cannot just test the variate only, but we should also test the single variables, because two questions cannot be answered if we would only test the variate:
- Have assumption violations for individual variables caused their relationships to be misrepresented?
- What are the sources and remedies of any assumptions violations for the variate.
Methods of diagnosis
The principal measure of prediction error for the variate is the residual, the difference between the observed and predicted variables. Some standardization is recommended to make the residuals comparable. The studentized residual is the most widely used. The values correspond with T values. Plotting the residuals versus the predicted values is a basis method of identifying assumption violation. However, there are some considerations:
- The most common plot is residuals (ri) versus the predicted dependent variable (Yi). In MRA only the predicted dependent variables represent the total effect of the regression variate, so the dependent variable is used.
- Violations of each assumption can be identified by specific patterns of the residuals (see figure 4-5). An important plot is the null plot, the plot of residuals where all assumptions are met. The residuals are falling randomly, with relative equal dispersion about zero and no strong tendency to be either greater or less than zero.
1. Linearity of the phenomenon. The linearity of the relationship represents the degree to which the change in the dependent variable is associated with the independent variable. The regression coefficient is constant across the range of values for the independent variable. The concept of correlation is based on a linear relationship, so it is crucial. Corrective action can be taken by one of three options:
- transforming data values
- directly including the nonlinear relationships in the regression model, such as through the creation of polynomial terms
- using specialized methods such as nonlinear regression
How do we know which variables we have to select for corrective action? We use partial regression plots, which show the relationship of a single independent variable to the dependent variable controlling for the effects of all other independent variables. Now the line running through the points in the plots will slope upward or downward, instead of being horizontal.
2. Constant variance of the error term. The presence of unequal variances is one of the most common assumption violations. Diagnosis is made with residual plots or simple statistical tests. The most common plot is triangle-shaped in either direction. If variation is expected in the midrange more than in the tails, a diamond-shape is expected. Spss provides the levene test for testing heteroscedasticity. If heteroscedasticity is present, there are two remedies:
- the procedure of weighted least squares can be employed when the violation can be attributed to a single independent variable through the analysis of residual plot
- easier are a number of variance-stabilizing transformations
3. Independence of the error terms. We can best identify independentness by plotting residuals against any possible sequencing variable. The pattern should appear random and similar to the null plot of residuals. Violations are identified by consistent patterns in the plots. (Examples are shown in figure 4-5e and f)
4. Normality of the error term distribution. Perhaps the most encountered violation is the assumption of normality. The simplest test is creating a histogram. This is more difficult in smaller samples. A better test is a normal probability plot. These differ from residual plotsin that the standardized residuals are compared with the normal distribution. The normal distribution makes a straight diagonal line.

Stage 4- Estimating the regression model and assessing the overall model fit
Having specified the objectives of the regression analysis, selected the independent and dependent variables, addressed the issues of research design and assessed the variables for meeting the assumptions of the regression, the researcher now is ready to estimate the regression model and assess the overall predictive accuracy of the independent variables. We must accomplish three basic tasks:
1. Select a method for specifying the regression model to be estimated
There are three approaches to specifying the regression model
- confirmatory specification
The researcher specifies the exact set of variables to be included. The researcher has total control over the variable selection. It is simple in concept, but the researcher is responsible for all trade-offs between more independent variables and greater predictive accuracy versus model parsimony and concise explanation.
- sequential search methods
There is a general approach of estimating the regression equation by considering variables until some overall criterion measure is achieved.there are two types of sequential search methods: stepwise estimation and forward addition and backward elimination.
# Stepwise estimation
This approach enables the researcher to examine the contribution of each developing equation. The independent variable with the greatest contribution is added first.

Independent variables are then selected for inclusion based on their incremental contribution over the variables already in the equation. The specific issues at each stage are as follows.
* start with a simple regression model by selecting the on independent variable that is the most highly correlated with the dependent variable (equation Y=b0+b1X1)
* examine the partial correlation coefficients to find an additional independent variable that explains the largest statistically significant portion of the unexplained variance
* recomputed the regression equation using the two independent variables, and examin the partial F value for the original variable in the model to see whether it still makes a significant contribution. If it doesn’t, eliminate the new variable
*continue this procedure until none of the candidates would contribute significantly.
# Forward addition and backward elimination
These are mainly trial and error processes for finding the best regression estimates. The forward addition model is similar to the stepwise method in that it builds the regression equation starting with a single independent variable, whereas the backward elimination method starts with a regression equation including all the independent variables and then deletes independent variables that do not contribute significantly. The primary distinction is that the stepwise method can delete or add variables in every stage. In the forward addition and backward elimination methods adding and deleting cannot be reversed.
To many researchers, sequential methods seem to be a perfect solution to the dilemma faced in the confirmatory approach by achieving the maximum predictive power with only those variables that contribute in a statistically significant way. Three critical caveats markedly affect the resulting regression equation
* the multicollinearity among independent variables has substantial impact on the final model specification
* all sequential search methods create a loss of control on the part of the researcher
* Especially in the step-wise procedure, the researcher should employ more conservative thresholds to ensure that the overall error rate across all significance tests is reasonable.
- Combinatorial approach
This is a generalized search process across all possible combinations of independent variables. The best known procedure is the all-possible- subsets regression, which is exactly as the name suggests. All possible combinations of independent variables are examined, and the best-fitting set is identified. Usage of this approach has decreased because of criticism on the a-theoretical nature and the lack of consideration of such factors as multicollinearity, the identification of outliers and influential, and the interpretability of the results.
The most important criterion in selecting an approach, is the researcher’s substantive knowledge of the research context and any theoretical foundation that allows for an objective and informed perspective as to the variables to be included as well as the expected signs and the magnitude of their coefficients. With the independent variables selected and the regression coefficient estimated, the researcher must now assess the estimated model for meeting the assumptions underlying multiple regression.
2. Assess the statistical significance of the overall model in predicting the independent variable
 

We always expect some variation if we would take different random samples.
to test the hypothesis that the amount of variation explained by the regression model is more than the baseline production, the F-ratio is calculated as:
F ratio=

Dfregression= number of estimated coefficients-1
dfresidual = sample size- number of estimated coefficients

Three important features of this ratio should be noted:
- Dividing each sum of squared by its degrees of freedom results in an estimate of the variance
- If the ratio of the explained variance to the unexplained is high, the regression variate must be of significant value in explaining the dependent variable. If the F value is statistically significant, we can state that the regression model is not just specific for this sample, but for multiple samples from the population.
- Although larger R2 values result in higher F values, practical significance must be assessed separately from statistical significance.
The addition of an extra variable will always lead to a higher R2 value. This creates concerns with generalizability. The impact is most noticeable if the sample size is close to the number of predictor variables. What is needed is a more accurate measuring relating the level of overfitting to the R2 achieved by the model. This measure involves an adjustment in the number of independent variables to the sample size. In this way, adding non-significant independent variables to increase R2 can be discounted. We also get the adjusted R2 in output of statistical programs. This adjusted R2 is useful in comparing across regression equations involving different numbers of independent variables or different sample sizes because it is making allowances for the degrees of freedom. Statistical significance testing for the estimated coefficients in regression analysis is appropriate when the analysis is based on a sample. Establishing the significance level denotes the chance the researcher is willing to take being wrong about whether the estimated coefficient is different from zero. Sampling error is the cause for variation in the estimated regression coefficients for each sample drawn from a population. For small sample sizes, sampling error is larger and the estimated coefficients will most likely vary widely from sample to sample. The standard error is the expected variation of the estimated coefficients due to sampling error. With the significance level and the standard error, we can compute the confidence interval for a regression coefficient. With the confidence interval in hand and the researcher must ask three questions about the statistical significance of any regression coefficient.
- was statistical significance established?
- how does the sample size come into play?
- does it provide practical significance in addition to statistical significance?
3. Determine whether any of the observations exert an undue influence on the results.
We shift our attention here to individual observations that lie outside the general patterns of data and strongly influence the regression results. There are different types of influential observations, these are the three basic types:
- outliers: observations that have large residual values and can be identified only with respect to a specific regression model.
- leverage points: observations that are distinct from the remaining observations based on coefficients for one or more independent variables.
- influential observations: all observations that have disproportional effect on the regression results.
Many times, influential observations are difficult to identify through the traditional analysis. Their patterns are not that different as outliers. General forms of influential observations( figure 4-8):
- reinforcing the general pattern and lowering standard error
- conflicting, contrary to the general pattern
- multiple influential points may work towards the same results
- shifting, influential observations may affect all of the results in a similar manner
Influentials, outliers and leverage points are all based on one of four conditions, each of which has a specific course of corrective action:
- an error in observation or data entry: remedy by correcting the data or deleting the case
- a valid but exceptional observation that is explainable by an extraordinary situation: remedy by deletion of the case unless the variables reflecting the extraordinary situation are included in the regression equation
- an exceptional observation with no likely explanation: remedy by analyzing with and without the observation
- an ordinary observation in its individual characteristics but exceptional in its combination of characteristics: remedy by modifying the conceptual basis

Stage 5- interpreting the regression variate
The next task is to interpret the regression variate by evaluating the estimated regression coefficients for their explanation of the dependent variable. Not only the regression model should be evaluated, but also the potential independent variables that were omitted if a sequential search or combinatorial method was employed.
The estimated regression coefficients , termed the b coefficients, represent both the type of relationship and the strength of the relationship between independent and dependent variables. Prediction is an integral element in the regression analysis, both in the estimation process and in the forecasting situations.
First, in the ordinary least square estimation procedure, used to derive the regression variate, a prediction of the dependent variable is made for each observation in the data set. A single predicted variable is computed. As such, the predicted value represents the total of all effects of the regression model and allows the residual to be used exensively as a diagnostic measure for the overall regression model.
The real benefit of prediction comes in forecasting applications.
Many times the researcher is interested in more than just prediction. It is important for a regression model to have accurate predictions to support its validity, but many research questions are more focused on assessing the nature and impact of each independent variable in making the prediction of the dependent variable. For explanatory purposes, the regression coefficients become indicators of the relative impact and importance of the independent variables in their relationship with the dependent variable. In many instances, the coefficients do not give us thus information directly. We must ensure that all of the independent variables are on comparable scales. Differences in variability can also disturb the analysis. What can help us is the beta coefficient. Standardization converts variables to a common scale and variability. Multiple regression does not only give us the regression coefficients, but also coefficients resulting from the analysis of standardized data called βcoefficients. They eliminate the problem of dealing with different units of measurement. Now we can determine which variable has the most impact.

Two cautions must be taken into consideration when dealing with beta coefficients:
- They should be used as a guide to the relative importance of individual independent variables only when collinearity is minimal.
- The beta values can be interpreted only in the context of the other variables in the equation.
A key issue in interpreting the regression variate is the correlation among independent variables. Some degree of multicollinearity is unavoidable. The researcher’s task is the following:
- assess the degree of multicollinearity
The easiest way is an examination of the correlation matrix for the independent variables. We need a measure expressing the degree to which each independent variable is explained by other independent variables. The 2 most common measures for assessing both pairwise and multiple variable collinearity are tolerance and the variance inflation factor.
Tolerance is defined as the amount of variability of the selected independent variable not explained by the other independent variables. Tolerance can be defined simply in two steps:
1. Take each independent variable and calculate R2* (the amount of the independent variable that is explained by other independent variables)
2. Tolerance is calculated as 1-R2*
If tolerance is high, multicollinearity is low.
The Variance inflation factor is calculated as the inverse of the tolerance value. High VIF values mean high multicolinearity.
- determine its impact on the results
The impact of multicollinearity can be categorized in terms of estimation or explanation. It always creates shared variance between the variables and thus decreases the ability to predict the roles of each independent variable.
Estimation. Firstly, in the extreme case of multicollinearity, singularity, the estimation of any coefficients is prevented. Secondly, as multicollinearity increases, the ability to demonstrate that the estimated regression coefficients are significantly different from zero become markedly impacted due to increases in the standard error as shown in the VIF value. This is the most problematic with small sample sizes. Finally, high degrees of multicollinearity can also result in regression coefficients being incorrectly estimated and even having the wrong signs. In some instances, this reversion of signs is desired, that is called a suppression effect. In other instances, signs are reversed because of multicollinearity. In these cases, the researcher may need to revert using bivariate correlations to describe the relationship rather than the estimated coefficients that are impacted by multicollinearity.
Explanation. As multicollinearity occurs, identifying the effects of each independent variable on the dependent variable becomes increasingly difficult.
Each researcher must determine what degree of multicollinearity is too much. Some suggested guidelines have been developed and can be found on page 200.
- apply remedies if needed
The researcher can apply a number of remedies:
1. Omit one or more highly correlated independent variables and identify other independent variables to help the prediction.
2. Use the model with the highly correlated independent variables for prediction only, while acknowledging the lowered level of overall predictive accuracy.

3. Use the simple correlations between each independent variable and the dependent variable to understand the independent-dependent variable relationship.
4. Use a more sophisticated method of analysis, such as Bayesian regression.

Stage 6- Validation of the results
The final step is to ensure that it represents the general population and is appropriate dor the situations in which it will be used.
The most appropriate validation approach is to test the regression model on a new sample drawn from the general population. It will ensure representativeness and can be used in several ways. It can predict values in the new sample and the predictive fit can be calculated. Also, a separate model can be estimated with the new sample and then be compared with the original equation. Many times, however, the ability to collect new data is limited by factors as cost, time or availability of respondents. Then the researcher can use a split sample. An alternative approach a researcher can use is calculating the PRESS statistic. It is a measure similar to R2 used to assess predictive accuracy. N-1 regression models are estimated. One observation is omitted in the estimation of the regression model and then predicts the omitted observation with the estimated model. The procedure is applied again and the residuals can be summed to provide an overall measure of predictive fit.
When comparing regression models, the most common standard used is overall predictive fit. R2 gives this information, but had one disadvantage: as more variables are added, R2 will always increase. Therefore, we use the adjusted R2, it will compensate for different sample sizes.
In forecasting, we must consider several factors that can have a serious impact on the quality of the new predictions:
1. When applying the model to anew sample, we must remember that the predictions now have not only the sampling variations from the original sample, but also those of the newly drawn sample. So we should calculate confidence intervals of our predictions in addition to the point.
2. We must make sure that the conditions have not changed after our original measurements.
3. We must not use the model beyond the range of independent variables found in the sample. 

Image

Access: 
Public

Image

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Submenu: Summaries & Activities
Follow the author: Vintage Supporter
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
2620
Search a summary, study help or student organization