Article summary of Principal component analysis by De Heus & Van der Leeden - Chapter
- What is principal component analysis (PCA) and what is its purpose?
- What does PCA do?
- How can we explain variance with PCA?
- What do we do before we do PCA?
- How many components does PCA use and how can we decide this?
- How does rotation and interpretation work?
- What are the most important SPSS settings needed?
- What are the assumptions of PCA?
- What is the difference between PCA and factor analysis (FA)?
What is principal component analysis (PCA) and what is its purpose?
Principal component analysis (PCA) is a multivariate data analysis technique used to analyze a large number of variables simultaneously. The analysis is concerned with the structure that the relationships between the variables take. The main goal is data reduction. PCA is often confused with factor analysis (FA), this distinction will be discussed later. The Purpose of PCA The aim is to get an idea of the most important associations between the items of, for example, a questionnaire. In this way, subgroups are obtained that strongly correlate with each other. These subgroups are known as components or factors. PCA is also important for the making of questionnaires. If subgroups correspond with each other we can combine them to create questionnaires or scales.
What does PCA do?
In particular, PCA is a tool for data reduction in which researchers try to lose as little information about the variables as possible. The information used in PCA is the association between the observed variables, expressed in correlations or variances and covariances. After these correlations are created the most important associations are combined into components. We can show this in different ways. The first way is geometrical. Here, two correlated variables are shown through two components that are not correlated. Thus, each component visualizes a variable and is built up from two observations from the same person on this variable. In practice, PCA is much more interesting with more variables, but we cannot show this geometrically anymore. This is why we can also show PCA algebraically. In this case, we use squares to represent the observed variables (Xi) and circles to show the components (Fj). We also try to represent the weight of the different variables and components (aij). The weight is also known as component charges. For example, if we want to show two components and five observed variables, the comparison of the two components (F1 and F2) looks like this:
F1 = a11X1 + A21X2 + a31X3 +a41X4 + a51X5
F2 = a12X1 + a22X2 + a32X3 + a42X4 + a52X5
These equations show that the components Fj are calculated as linear combinations of the observed variables Xi. The weights aij are also shown, in which the i refers to the observed variable and the j to the component. F1 is as equal as possible to all X variables. Components can also be considered variables themselves, and can be used as variables in other data analysis techniques. Component scores can thus be used to examine individual differences.
How can we explain variance with PCA?
In PCA, the correlation between the variables and components is maximized. The variables and components also have common variance and because we maximize the correlation, this variance is also maximized. Because of this we can predict how much of the variance of a certain variable can be explained by a component. All components together can explain a certain amount of the variance. In PCA the observed variables are usually standardized.
The component charges aij are equal to the Pearson correlations between Xi and Fj. When we square the charge we get the proportion of variance of Xi that is explained by Fj. The amount of variance of a certain group of observed variables a component can explain is known as eigenvalue. The eigenvalue is equal to the sum of the squared factor charges of all X variables that are part of that component. This is mathematically shown as λj Σpi=1 aij2. The eigenvalue of the first component is always the largest, then follows the second component, etc. Dividing the eigenvalue by the number of variables gives you the proportion of explained variance.
The amount of variance of variable Xi that can be explained by the component is known as the communality of that variable. This is mathematically shown as h². When we sum the squared component charges of variable Xi of the components gives the communality of that variable, mathematically shown as: hi2 Σkj=1 aij2. Communality is also used as a measure of fit: how well a variable fits into a factor sollution.
If there are as many components as there are variables, 100% of the variance is explained by the components. This is known as full dimensionality. Full dimensionality means however that we cannot speak of data reduction. We also have a rotation problem: the specific system of coordinates does not matter for the solution, meaning there are an infinite number of ways the describe the same solution.
What do we do before we do PCA?
PCA does not always provide a useful solution. The main threats are that PCA solutions may be unstable (varying widely from sample to sample) or merely display randomness. Therefore, preliminary checks must be made.
First, we need to protect against randomness or arbitrariness. We can do this with Bartletts test. This test checks whether the correlations between the variables in the analysis are zero. We want this test to be statistically significant. However, with real psychological data this test always is. This is why the best indication of usability is the Kaiser-Meyer-Olkin (KMO) measurement. If the data has a clear factor structure, the partial correlations between pairs of variables are very close to zero. It is calculated as follows: sum of squared correlations / (sum of squared correlations + sum of squared partial correlations). KMO usually has a value between 0.5 (worst case) and 1.00 (best case, when all partial correlations are zero). Values higher than 0.7 are good enough.
Second, we need to get stable results that do not vary from sample to sample. We can do this by getting a big enough sample. Rules of thumb:
- Sample size: a sample equal to or bigger than 300 is almost always big enough.
- Factor loadings: if a factor has four or more loadings with an absolute value greater than 0.6, the sample size doesn't matter. Factors with at least ten loadings greater than 0.4 are stable when N is greater than 150.
- Communalities: if (almost) all communalities are greater than 0.6, a sample size of 100 is good, and above 0.5 a sample size between 100 and 200 is good enough.
How many components does PCA use and how can we decide this?
Finding the factors or components is called factor extraction. With more factors, more variance is explained, but the solution is less useful. There is no universal criterion to solve this problem, but four criteria can be used:
- The eigenvalue-greater-than-one criterion: the eigenvalue must be at least one, because each variable has a variance of one.
- Interpretability: components must be meaningful. We determine this on the basis of general knowledge, theory or knowledge of previous research.
- The elbow criterion: according to this criterion, you must stop extracting factors at the point in the eigenvalue graph where the curve resembles an elbow. Sometimes no clear 'elbow' can be found, or several can be seen. The advice is to try j-1 and j+1 components, and choose the one with the clearest interpretation.
- The 'crushed stone' criterion: sometimes there are points in the eigenvalue graph through which a clear straight line can be drawn. The criterion says that the solution must contain only those components whose eigenvalues are above this straight line. The problem is that there is not always a clear straight line.
It is advised to always make a graph with SPSS of the eigenvalues. If an 'elbow' is visible, this is always the best method (plus or minus 1).
How does rotation and interpretation work?
A PCA solution can be described in an infinite number of ways, which are mathematically similar, but can lead to very different interpretations. This is called the rotation problem. The rotation problem arises from changing the perspective from which we look at the solution. For example, if we rotate F1 and F2 30 degrees to the right, new X and Y axes (F1' and F2') are created. With this, new coordinates for each vector (or variable) can be calculated. This leads to different interpretations.
The aim of doing such a rotation is to find an understandable and therefore simple solution. A simple solution has as little components as possible and each component must be related to a small number of observed variables. Such a simple solution is almost impossible because the components are chosen in such a way that the first component explains as much variance as possible, then the second component also has to explain as much variance as possible and so on. This is good for data reduction, but not a simple solution. PCA is biased towards finding a common first component. This common first component highlights what all observed variables have in common. All subsequent components are contrast components, because they are orthogonal to the first component. Especially in situations with many items and components, this solution is complex. Therefore, a rotation that finds a simple structure in the data is desirable. This can be done in various ways.
The first rotation method is orthogonal rotation or VARIMAX rotation. The VARIMAX rotation attempts to maximize the variance of the loadings for each component. If there is a simple structure in the data (homogeneous subsets of items that are not strongly correlated with items of other subsets), VARIMAX finds this with more ease than the other rotations. After VARIMAX, the communalities remain the same. The total explained variance also remains the same. The new coordinate system remains orthogonal (at an angle of 90 degrees). The second rotation method is known as non-orthogonal rotation, oblique rotation or OBLIMIN rotation. This is a rotation method where the angles of the axes change. This provides a more realistic interpretation, but it does get more complicated. Therefore, and for historical reasons, VARIMAX is the most widely used. Always do a VARIMAX rotation, but if this solution is incomprehensible, the unrotated solution should be looked at.
What are the most important SPSS settings needed?
- Method of analysis: in the extraction screen within 'factor' we can choose to do a PCA.
- The amount of components: can also be chosen in the extraction screen.
- The unrotated solution and eigenvalue plots: can be found by checking 'unrotated factor solution' and 'Scree plot'.
- Rotation method: in the rotation screen we can select 'VARIMAX'
- The sorting of variables: we can choose 'sorting by size' in the options screen to sort variables from high to low charges on each component.
Where can we find eigenvalues, communalities and explained variance in SPSS?
Under the heading 'extraction sum of squared loadings' in SPSS are the eigenvalues and the explained variance for the unrotated components. If there is a strong data reduction, the explained variance is usually not too high. The explained variance and the eigenvalues for the rotated components can be found under the heading 'rotation sum of squared loadings'. The rotation does not change the total explained variance, but the variance is more evenly distributed among the components. If a variable has a low communality, it is a unique variable. This could mean that the variable is measuring something completely different than we want, but it could also mean that the item is measuring something important, which is not measured by other items.
What are the assumptions of PCA?
PCA has three assumptions:
- The relationships between the observed variables are linear. We can check this by using scatterplots.
- The observed variables follow a multivariate bell-curve. This assumption is not the most important one to stick to.
- The correlations between the observed variables must be reliable. To get to this the sample must be representative and big enough.
What is the difference between PCA and factor analysis (FA)?
PCA is an empirical summary of data. This is because we find components by adding observed variables with a certain weight. The components represent the observed variables, and as such PCA is a summary. It wants to figure out what part of the variance of the variables is important and which part is not.
FA on the other hand, wants to discover what is common in the variance for variables in comparison to what is unique for individual variables. In FA a model is made for the observed associations within a set of variables. This model tries to reproduce the associations as closely as possible. The factors in this model, which are also known as latent variables, are hypothetical constructs that we can only measure indirectly, which is not the case for PCA. In FA, a unique factor is also created for each variable. This unique factor explains all the variance of that variable which could not be explained by the general factors. The unique factors thus show error. FA is sometimes called a structural or causal model.
The factor analysis model that FA uses specifies each observed variable as the weighted sum of the common factors and a unique factor specific to that variable. In FA, the factors are the cause of the observed correlations, unlike in PCA. The second difference is the presence of unique factors (Uj). Unique factors are not correlated with each other or with general factors. FA and PCA often come to the same conclusion because they use the same information. Results may differ only for a small number of variables. FA is mainly a model for the observed correlations, while PCA is a technique that tries to explain as much variance as possible.
In summary, we can say that PCA wants to explain the variances, while FA wants to explain the covariance of the observed variables.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Contributions: posts
Spotlight: topics
Online access to all summaries, study notes en practice exams
- Check out: Register with JoHo WorldSupporter: starting page (EN)
- Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)
How and why use WorldSupporter.org for your summaries and study assistance?
- For free use of many of the summaries and study aids provided or collected by your fellow students.
- For free use of many of the lecture and study group notes, exam questions and practice questions.
- For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
- For compiling your own materials and contributions with relevant study help
- For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.
Using and finding summaries, notes and practice exams on JoHo WorldSupporter
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
- Use the summaries home pages for your study or field of study
- Use the check and search pages for summaries and study aids by field of study, subject or faculty
- Use and follow your (study) organization
- by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
- this option is only available through partner organizations
- Check or follow authors or other WorldSupporters
- Use the menu above each page to go to the main theme pages for summaries
- Theme pages can be found for international studies as well as Dutch studies
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
- Check out: Why and how to add a WorldSupporter contributions
- JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
- Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form
Quicklinks to fields of study for summaries and study assistance
Main summaries home pages:
- Business organization and economics - Communication and marketing -International relations and international organizations - IT, logistics and technology - Law and administration - Leisure, sports and tourism - Medicine and healthcare - Pedagogy and educational science - Psychology and behavioral sciences - Society, culture and arts - Statistics and research
- Summaries: the best textbooks summarized per field of study
- Summaries: the best scientific articles summarized per field of study
- Summaries: the best definitions, descriptions and lists of terms per field of study
- Exams: home page for exams, exam tips and study tips
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
- Studies: Bedrijfskunde en economie, communicatie en marketing, geneeskunde en gezondheidszorg, internationale studies en betrekkingen, IT, Logistiek en technologie, maatschappij, cultuur en sociale studies, pedagogiek en onderwijskunde, rechten en bestuurskunde, statistiek, onderzoeksmethoden en SPSS
- Studie instellingen: Maatschappij: ISW in Utrecht - Pedagogiek: Groningen, Leiden , Utrecht - Psychologie: Amsterdam, Leiden, Nijmegen, Twente, Utrecht - Recht: Arresten en jurisprudentie, Groningen, Leiden
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
1248 |
Add new contribution