Article summaries of Scientific & Statistical Reasoning - UvA

Summaries with the mandatory articles for Scientific & Statistical Reasoning at the University of Amsterdam, 2020-2021

Check supporting content in full:
Understanding Psychology as a Science - Dienes - 2008 - Article

Understanding Psychology as a Science - Dienes - 2008 - Article


In this chapter the standard logic of statistical inference will be considered. The Neyman-Pearson approach is the name of this logic.

Before reading this chapter, make sure you understand the following concepts:

  • Standard deviation.

  • Standard error.

  • Null hypothesis.

  • Distribution.

  • Normal Distribution.

  • Population.

  • Sample.

  • Significance.

  • T-test.

Who are Fisher, Neyman and Pearson?

The British genius Sir Ronald Fisher thought of many of the techniques and concepts we use in statistics. He created much of what the users of statistics now recognize as statistical practice.

The Polish mathematician Jetzy Neyman and the British statistician Egon Pearson provided a firm consistent logical basis to hypothesis testing and statistical inference. Fisher did not appreciate it, but it still transformed the field of mathematical statistics and defined the logic that journal editors came to demand of the papers they publish.

What is probability?

The meaning of probability we choose determines what we can do with statistics. The Neyman-Pearson approach follows from one particular interpretation of probability.

Interpretations often start with a set of axioms that probabilities must follow. Probabilities are things that follow the axioms of probability.

According to the subjective interpretation of probability, a probability is a degree of conviction in a belief.

According to the objective interpretation of probability, a probability is set in the world instead of in the mind. Objective probabilities exist independently of our states of knowledge. They are to be discovered by examining the world.

The most influential objective interpretation of probability is the long-run relative frequency interpretation of von Mises. A probability is a relative frequency. The hypothetical infinite set of events is called the reference class or collective. Because the long-run relative frequency is a property of all the events in the collective, it follows that a probability applies to a collective, not to any single event. Objective probabilities do not apply to single cases. They also do not apply to the truth of hypotheses. A hypothesis is simply true or false, just as a single event either occurs or does not. A hypothesis is not a collective, it therefore does not have an objective probability.

What are hypotheses?

In this paragraph, some data will be symbolized by D and a hypothesis by H. The probability of obtaining some data given a hypothesis will then be P(D|H). P(H|D) is the inverse of the conditional probability p(D|H). Inverting conditional probabilities makes a big difference. In general, P(A|B) can have a very different value from P(B|A).

If you know P(D|H), it does not mean you know what P(H|D) is. This is the case for two reasons:

  • Inverse conditional probabilities can have very different values.

  • It is meaningless to assign an objective probability to a hypothesis.

Statistics cannot tell us how much to believe a certain hypothesis. According to Neyman and Pearson, we can set up decision rules for accepting or rejecting hypotheses, such that in following those rules in the long run we will not often be wrong. We can work out what the error rates are for certain decision procedures and we can choose procedures that control the long-run error rates at acceptable levels.

This decision rules work by setting up two contrasting hypotheses: null hypothesis (H0) and the alternative hypothesis (H1). Both kind of hypotheses can be of some specific difference as well as being a band. The only difference between the two is that the null hypothesis is the one most costly to reject falsely.

Parameters are properties of populations and are symbolized with Greek letters. Statistics are summaries of sample measurements and are symbolized with Roman letters. The null and alternative hypotheses are about population values. We try to use our samples to make inferences about the population.

For a given experiment we can calculate p = P('getting t as extreme or more extreme than obtained' I H0), which is a form of P(D I H). This p is the 'p-value' or simply ‘p’ statistical computer output. If p is less than α, the level of significance we have decided in advance (say 0.05), we reject H0. By following this rule, we know in the long run that when H0 is actually true, we will conclude it false only 5 % of the time. In this procedure, the p-value has no meaning in itself; it is just a convenient mechanical procedure for accepting or rejecting a hypothesis, given the current widespread use of computer output. Such output produces p-values as a matter of course.

α is an objective probability, a relative long-run frequency. It is the proportion of errors of a certain type we will make in the long run, if we follow the above procedure and the null hypothesis is in fact true. Conversely, neither α nor our calculated p tells us how probable the null hypothesis is.

α is the long-term error rate for one type of error: saying the null is false when it is true (Type I error). But there are two ways of making an error with the decision procedure. The type II error happens when one accepts the H0 as true when in fact it is false (this is symbolized as β). When the null is true, we will make a Type I error in α proportion of our decisions in the long run. Strictly using a significance level of 5% does not guarantee that only 5% of all published significant results are in error.

Controlling β does not mean you have controlled α as well. They can be very different from one and other. The definition of power is 1 – β. Power is the probability of detecting an effect, given effect really exists in the population.

In the Neyman-Pearson approach, one decides on acceptable α and β levels before an experiment is run. In order to control β, you need to:

  1. Estimate the size of effect you think is interesting, given your theory is true.

  2. Estimate the amount of noise your data will have.

Having determined (1) and (2) above, you can use standard statistics textbooks to tell you how many participants you need to run to keep β at 0.05 (equivalently, to keep power at 0.95).

Strict application of the Neyman-Pearson logic means setting the risks of both Type I and II errors (α and β) in advance. Many researchers are extremely worried about Type I errors, but allow Type II errors to go uncontrolled. Ignoring the systematic control of Type II errors leads to inappropriate judgments about what results mean and what research should be done next.

The process of combining groups of studies together to obtain overall tests of significance (or to estimate values or calculate confidence intervals) is called meta-analysis.

A set of null results does not mean you should accept the null; they may indicate that you should reject the null.

To summarize, if your study has low power, getting a null result tells you nothing in itself. You would expect a null result whether or not the null hypothesis was true. In the NeymanPearson approach, you set power at a high level in designing the experiment, before you run it. Then you are entitled to accept the null hypothesis when you obtain a null result. In following this procedure you will make errors at a small controlled rate, a rate you have decided in advance is acceptable to you.

What is sensitivity?

Sensitivity can be determined in three ways:

  • Power.

  • Confidence intervals.

  • Finding an effect significantly different from another reference one.

Whenever you find a null result and it is interesting to you that the result is null, you should always indicate the sensitivity of your analysis. Before you could accept the null result in the condition that was null, you would need to show you had appropriate power to pick a minimally interesting effect.

What are stopping rules?

The stopping rules you use are defined by the conditions under which you will stop collecting data for a study. Use power calculations in advance of running the study, the standart Neyman-Pearson stopping rule, to determine how many participants should be run to control power at a predetermined level. Both alfa and beta can then be controlled at known acceptable levels. The use of confidence intervals is another good stopping rule.

What is multiple testing?

In the Neyman-Pearson approach it is essential to know the collective or reference class for which we are calculating our objective probabilities α and β. The relevant collective is defined by a testing procedure applied an indefinite number of times.

In the Neyman-Pearson approach, in order to control overall Type I error, if we perform a number of tests we need to test each one at a stricter level of significance in order to keep overall α at 0.05. There are numerous corrections, but the easiest one to remember is Bonferroni. If you perform k tests, then conduct each individual test at the 0.05/k level of significance and overall a will be no higher than 0.05.

What are points concerning significance tests that are often misunderstood?

  • Significance is not a property of populations.

  • Decision rules are laid down before data are collected; we simply make black and white decisions with known risks of error.

  • A more significant result does not mean a more important result, or a larger effect size.

What are confidence intervals?

Confidence interval is a set of possible population values the data are consistent with. This concept was developed by Neyman. To calculate the 95 % confidence interval, find the set of all values of the dependent variable that are non-significantly different from your sample value at the 5 % level.

Use of the confidence interval overcomes some of the problems people have when using Neyman-Pearson statistics otherwise:

  • It tells the sensitivity of your experiment directly; if the confidence interval includes the value of both the null hypothesis and the interesting values of the alternative hypothesis, the experiment was not sensitive enough to draw definitive conclusions.

  • It turns out you can use the confidence interval to determine a useful stopping rule: when the interval is of a certain predetermined width, stop collecting data.

  • Confidence intervals are a very useful way of summarizing what a set of studies as a whole are telling us.

  • Like all statistics in the Neyman-Pearson approach, the 95 % confidence interval is interpreted in terms of an objective probability.

What is the criticism of the Neyman-Pearson approach?

  • Simple acceptance or rejection is all what inference consists of. Arguably, what a scientist wants to know is either how likely certain hypotheses are in the light of data or how strong the evidence supports one hypothesis rather than another.

  • Weak theorizing is encouraged by null hypothesis testing. A good theory should specify the size of effect not just that it is different from zero.

  • It is important to know the reference class in the Neyman-Pearson approach – we must know what endless series of trials might have happened but never did. This is important when considering both multiple testing and stopping rules. It strikes some as unreasonable that what never happened should determine what is concluded about what did happen.

How do you use the Neyman-Pearson approach to critically evaluate a research article?

If the article uses significance or hypothesis tests, then two hypotheses need to be specified for each test

Note from the introduction section of the paper whether any specific comparisons were highlighted as the main point of the experiment. These comparisons, if few in number, can be treated as planned comparisons later. If a direction is strongly predicted at this point, one-tailed tests could be considered later.

The stopping rule should be specified. If very different numbers of subjects are used in different experiments in the paper for no apparent reason, it may be a sign that multiple significance tests were conducted, as each experiment progressed and stopping occurred when the required results were obtained.

Even if minimally interesting effect sizes were not stated in advance and if power were not stated in advance, a crucial point is how the authors dealt with interesting null results. Given a null result was obtained, did the authors give some measure of sensitivity of the test?

Access: 
Public
False-positive psychology: Undiscovered flexibility in data collection and analysis allows presenting anything as significant - Simmons et al. - 2011 - Article

False-positive psychology: Undiscovered flexibility in data collection and analysis allows presenting anything as significant - Simmons et al. - 2011 - Article


Introduction

A false positive is likely the most costly error that can be made in science. A false positive is the incorrect rejection of a null hypothesis.

Despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not.

Many researchers often stop collecting data on the basis of interim data analysis. Many researchers seem to believe that this practice exerts no more than a trivial influence on the false-positive rates.

Solutions for authors

The authors of this article offer six requiremets for authors as a solution to the problem of false-positive publications:

  1. Before the collection of data begins, authors must decide the rule for terminating data collection and they should report this rule in the article.
  2. At least 20 observations per cell must be collected by the author or else the author should provide a compelling cost-of-data-collection justification.
  3. All variables collected in a study must be listed.
  4. All experimental conditions must be reported, including failed manipulations.
  5. If observations are eliminate, authors must also report what the statistical results are if those observations are included.
  6. Authors must report the statistical results of the analysis without the covariate, if an analysis includes a covariate.

Guidelines for reviewers

The authors of this article also offer four guidelines for reviewers:

  1. Reviewers must make sure that authors follow the requirements.
  2. Reviewers should be more tolerant of imperfections in results.
  3. Reviewers must make possible that authors are able to demonstrate that their results do not hinge on arbitrary analytic decisions.
  4. Reviewers should require the authors to conduct an exact replication, if justifications of data collection or analysis are not compelling.

Conclusion

The solution offered does not go far enough in the sense that it does not lead to the disclosure of all degrees of freedom. It cannot reveal those arising from reporting only experiments that ‘work’ (i.e., the file-drawer problem).

The solution offered goes too far in the sense that it might prevent researchers from conducting exploratory research. This does not have to be the case if researchers are required to report exploratory research as exploratory research. This also does not have to be the case if researchers are required to complement it with confirmatory research consisting of exact replications of the design and analysis that ‘worked’ in the exploratory phase.

The authors considered a number of alternative ways to address the problem of reasearcher degrees of freedom. The following are considered and rejected:

  • Correcting the alpha levels. A researched could consider adjusting the critical alpha level as a function of the number of researcher degrees of freedom employed in each study.
  • Using Bayesian statistics. This approach has many virtues, it actually increases researcher degrees of freedom by offering new set of analyses and by requiring to make additional judgments on a case-by-case basis.
  • Conceptual replications. They are misleading as a solution to the problem at hand, because they do not bind researchers to make the same analytic decisions across studies.
  • Posting materials and data. This would impose too high a cost on readers and reviewers to examine the credibility of a particular claim.

The goals of researchers is to discover the truth, and not to publish as many articles as they can. For different reasons researchers could lose sight of this goal.

Access: 
Public
Causal inference and developmental psychology - Foster - 2010 - Article

Causal inference and developmental psychology - Foster - 2010 - Article


Causality is central to developmental psychology, psychologists do not only want to identify developmental risks but also to understand mechanisms by which development can be fostered. But sometimes certain conditions or characteristics can't be assigned randomly. So, causal inference - inferring causal relationships - is difficult. An association alone do not reveal causal relationships. The last 30 years have produced superior methods for moving from association to causation. The aim of this article reflects the current state of developmental psychology and is guided by four premises: 

  1. Causal inference is essential to accomplishing the goals of developmental psychologists. Causal inference should be the goal of developmental research in most circumstances.
  2. In many analyses, psychologists unfortunately are attempting causal inference but doing so hardly, that is, based on many implicit and implausible assumptions. 
  3. The assumption should be identified explicit and checked empirically and conceptually. 
  4. Developmental psychologists will recognize the central importance of causal inference and naturally embrace the methods available.

This article also wants to promote broader thinking about causal inference and the assumptions on which it rests. But a characteristic of the broader literature, is that methodologists in different fields also differ substantially.

What is the confusion in current practice?

All the articles now lack in one of two directions: both are dissatisfying and potentially misleading. The authors hold causal inference as unattainable. A second group of authors embrace causality, there researchers often rely on the longitudinal nature of their data to make the leap form associations to causality. But authors often leave this assumptions unstated or may unaware of the assumptions themselves. There are also some authors who straddle the two groups, these authors often stray into interpretations of what are associations. But, the situation creates a swamp of ambiguity in which confusion thrives. This 'counterfactual' lies at the heart of causal inference. 

Why causal inference?

Causal thinking and therefor causal inference are unavoidable. One can support this claim in three ways:

  1. A major goal of psychology is to improve the lives of humanity. Much of developmental science is devoted to understanding processes that might lead to interventions to foster positive development.
  2. Causal analysis is unavoidable, because causal thinking is unavoidable.
  3. If a researcher resists the urge to jump from association to causality, other researchers seem willing to do so on his or her behalf.

How is causal inference the goal of Developmental psychology?

It is not the case that causal relationships can never be established outside of random assignment, but they cannot be inferred from associations alone. This research is used to make causal inference as plausible as possible. As part of the proper use of this tools, the researcher should identify the key assumptions on which they rest and their plausibility in any particular application. But what is credible or plausible is not without debate. This paper cannot resolve this issue, but its broader purpose is to establish plausible causal inference as the goal of empirical research in development psychology.

What are the two frameworks for causal inference?

There are two frameworks that are useful for conducting causal inference, and two conceptual tools that are especially helpful in moving from associations to causal relationships. The first involves directed acyclic graph (DAG). This assists researchers in identifying the implications of a set of associations for understanding causality and the set of assumptions under which those associations imply causality. 

What is the DAG?

Computer scientists are also interested in causality, or in particular, in identifying the circumstances under which the association can be interpreted as causal. A DAG comprises variables and arrows linking them. It is directed in a sense that the arrows represent causal relationships. The model assumes a certain correspondence between the arrows in the graph and the relationships between the variables. If you can't trace a path from one variable to another variable, then the variables are not associated. This is the Markov assumption 'the absence of a path implies the absence of a relationship'. A key structure of the DAG is structural stability: an intervention on one component of the model does not alter the broader structure. It also has a preference for simplicity and probabilistic stability. 

The DAG looks like a path diagram but has some distinguish features:

  • The DAG is not linear or parametric.
  • It contains no bidirectional arrows implying simultaneity
  • The essence of the DAG can be grasped by thinking about three variables X,Y and Z. You can think about Z as a common cause of X and Y, about Z as a common effect of X and Y, and about Z as a mediator of X on Y. 

The usefulness of the DAG is perhaps the most apparent when more than three variables are involved, especially when one is unmeasured (think about an unobserved determinant of the mediator). 

Access: 
Public
Confounding and deconfounding: or, slaying the lurking variable - Pearl - 2018 - Article

Confounding and deconfounding: or, slaying the lurking variable - Pearl - 2018 - Article


The biblical story of Daniel encapsulates in a profound way the conduct of experimental science today. 'When King Nebudchadnezzar brought back thousands of captives, he wanted his followers to pick out the children who were not blemished and skillful in all wisdom. But there was a problem, because his favorite one, the boy named Daniel, refused to touch the food the King gave them out of religious reasons. The followers of the King were terrified because of this problem and how the King would react. But Daniel proposed a experiment; try for ten days giving him only vegetables and take another group of children and feed them the King's meat and wine. After ten days the two groups were compared. Daniel prospered the King's diet, and because of this and their healthy appearance, Daniel became the most important person of the Kingdom'. The followers make up a question about the causation, will a vegetarian diet cause servants to be healthy? Daniel proposes at his turn a methodology to deal with this question by comparing the two groups after ten days of experimenting. And after a suitable amount of time, you can see a difference between the two groups. Nowadays this is called a controlled experiment. 

You can't go back in time and see, when Daniel did eat the meat and wine, what will happen to him comparing to the healthy diet. But because you can compare Daniel with a group of people who will get a different treatment, you can see what will happen when you give people a different diet. But the groups need to be representative of the population and comparable with each other. 

But Daniel didn't think of one thing; confounding bias. Suppose that Daniel is healthier than the control group to start with, their robust appearance after the ten days of eating the healthy diet will have nothing to do with the diet itself. Confounding bias occurs when a variable influences both who is selected for the treatment and the outcome of the experiment. Sometimes the confounders are known as the 'lurking third variable'. 

But, statisticians both over- and underestimate the importance of adjusting for possible confounding variables. They overrate in the sense that they often control for many more variables than they need to and even for variables that they should not control for. The idea is 'the more things you control for, the stronger the your study seems', because it gives the feeling of specificity and precision. But sometimes, you can control for too much. 

Statisticians also underestimate the importance of controlling for possible confounding variables in the sense that they are loath to talk about causality at all, even if the controlling has been done correctly.

In this chapter you will get to know why you can safely use RCT's, randomized control trials, to estimate the causal effect X -> Y without falling prey to the confounder bias. 

What is meant with the 'chilling fear of confounding'? 

In 1998, an important study showed the association between regular walking and reduced death rates among retired men. The researcher wanted to know whether the men who exercised more lived longer. He found that the death rate over a twelve year period was two times higher among men who were a 'casual walker' (less than a mile a day) than among men who were 'intense walkers' (more than two miles a day). But you have to keep in mind the influence a confounding variable or bias might have. 

This classic causal diagram shows us that age is a confounder of walking and mortality. But also, maybe, physical condition could be a confounder. But by saying so, you can go on and on about what could be possible confounders. But, even if the researchers adjusted the death rate for age found that the difference between causal and intense walkers was still large.

The skillful interrogation of nature: Why do RCT's work?

An RCT is often considered the gold standard of a clinical trial, and the person to thank for this is R.A. Fisher. The question he asks are 'aimed at establishing causal relationships'. And what gets in the way is confounding. Nature is like a genie that answers exactly the question we pose, not necessarily the one we intend to ask. And around 1923/1924 Fisher began to realize that the only experimental design that the genie could not defeat was a random one. When you do an experiment multiple times, sometimes you may get lucky and apply it tot the most fertile subplots. But by generating a new random assignment each time you perform the experiment, you can guarantee that the great majority of the time you will be neither lucky nor unlucky. Now, the randomized trials are a golden standard, but in the time of Fisher an randomly designed experiment horrified him and all his statistical colleagues. But Fisher realized that an uncertain answer to the right question is much better than a highly certain answer to the wrong question. 

When you ask the genie the wrong question, you will never find out what you want to know. If you ask the right question, getting an answer that is occasionally wrong is much less of a problem. So, randomization brings two benefits:

  1. It eliminates the confounder bias.
  2. It enables the researcher to quantify his uncertainty. 

In a nonrandomized study, the experimenter must rely on her knowledge of the subject matter. If she is confident that her causal model accounts for sufficient numbers of deconfounders and she has gathered data on them, then she can estimate the effects in an unbiased way. But the danger here is that she might have missed a confounding factor, and her estimate may therefor be biased. 

But RCT's are still preferred to observational studies. In some cases, intervention may be physically impossible, or intervention could be unethical, or you have difficulties recruiting subjects for inconvenient experimental procedures and end up with only volunteers who don't quite represent the intended population.

What is the new paradigm of confounding?

While confounding is widely recognized as one of the central problems in research, a review of literature about this will reveal little consistency among the definitions of confounding or confounder. But why has the confounding problem not advanced a bit since Fisher? Because of lacking a principled understanding of confounding, scientist could not say anything meaningful in observational studies where physical control over treatments is infeasible. But how was confounding defined then and how is it defined now? It is easier to answer the second question, with the information we have now. Confounding can simply be defined as anything that leads to a discrepancy between the two P(Y|X) (the conditional probability of the outcome given the treatment) and P(Y |do(X)) (the interventional probability).But why is it so difficult? The difficulty is there because it isn't a statistical notion. It stands for the discrepancy between what we want to assess (the causal effect) and what we actually do assess using the statistical methods. But if you can't mathematically articulate what you want to assess, you can't expect to define what constitutes a discrepancy. 

The concept of 'confounding' has evolved around the two related conceptions: incomparability and lurking third variables. Both of these concepts have resisted formalization. Because how do we know what is interesting and relevant to study and to distinguish and what not? You can say it is common sense, but many scientist have struggled with finding the important things to consider. 

What are the two surrogate definitions of confounding? They fall into two main categories: declarative and procedural. A old procedural definition that goes by the scary name of 'noncollapsibilty'. You can compare the relative risk and the relative risk after adjusting for the potential confounder. And the difference indicates confounding and you should use the adjusted risk estimate. 

The declarative definition is 'the classic epidemiological definition of confounding' and it consists of three parts: A confounder of X (treatment) and Y (outcome) is a variable Z that is (1) associated with X in the population at large and (2) associated with Y among people who have not been exposed to the treatment X. In the recent years there has been supplemented a third condition: (3) Z should not be on the causal path between X and Y. But this idea is a bit confusing I would say. 

You can't always use Z as a perfect measure for M, when you do some of the influence of X on Y might 'leak through' if you control for Z. But controlling for Z is still a mistake, while the bias might be less if you controlled for M, it is still there. That is why Cox (1958) warned that you should only control for Z if you have a 'strong priori reason' to believe that it is not affected by X. This is nothing more than a causal assumption. 

Later, Robins and Greenland set out to express their conception of confounding in terms of potential outcomes. Also ideally, each person in their experiment would be exchangeable with the person in the other condition. So, you would have the same person in the treatment and in the control group that confounding variables could be very low. The outcome could be the same if you switched the treatments and controls. By using this idea, Robins and Greenland showed that both the declarative and procedural definition were wrong. 

What does the do-operator and the back-door criterion mean?

To understand the back-door criterion you first have to have an idea of how information flows in a causal diagram. It looks like links of pipes that convey information from a starting point X to a finish Y. The do-operator erases all the arrows that come into X and in this way prevents any information about X form flowing in noncausal direction. If you have longer pipes with more junctions:

A F -> G I -> J?

The answer is very simple, if a single junction is blocked, then J cannot 'find out' anything about A through this path. So, you have a lot of options to block communication between A and J. A back-door path is any path from X to Y that starts with an arrow pointing into X, X and Y will be deconfounded if we block every back-door path. So, you can almost treat deconfounding like some game. The goal of the game is to specify a set of variables that will deconfound X and Y. With other words: they should not be descended from X, and they should block all the back-door paths.

This is a new kind of bias, called the M-bias. There is only one back-door path, and this one is already blocked by a collider at B, so you don't need to control for anything else. It is incorrect to call a variable a confounder, like B, merely because it is associated with X and Y. B only becomes a confounder when you control for it! But when you are going to use identifying variables such as smoking, miscarriage etc. they are obviously not games but serious business that you are dealing with. 

Access: 
Public
Critical Thinking in Quasi-Experimentation - Shadish - 2008 - Article

Critical Thinking in Quasi-Experimentation - Shadish - 2008 - Article


Introduction

With experiments we manipulate an assumed cause and then we observe which effects do follow. This is used for all modern scientific experiments, we try to discover the effects that a cause generates. It is also used in quasi-experiments, but in that context we have to think critically about causation. In a quasi-experiment there is no random assignment used to conditions, but they are carefully chosen.

An Example of a quasi-experiment

For example, in a quasi-experiment children were chosen for the control group. They tried to create a control group that is the same, as much as possible, as the treatment group.

Causation

In our daily lives we mostly intuitively recognize causal relationships. Nevertheless, for many years a precise definition to cause and effect hasn’t been made by philosophers. The definitions depend partly on each other.

What is a cause?

When we look at causes, such as lightning or a lighted match, we can see that none of them is sufficient to generate the effects. We also need multiple other conditions. For example, a lighted match alone is not enough to start a fire, we also need oxygen, combustible materials, it has to be dry, and so on. We can call this lighted match an ‘inus condition’. This means that it’s an insufficient but still needed part of an unnecessary but sufficient condition. We need many factors for an effect to occur, but most of the time we don’t know all of them or their relation.

Experimental causes

For experimental causes the critical feature is that they are manipulable, otherwise we cannot deliberately vary them to discover what happens. When we look at quasi-experiments, the cause is the thing that was manipulated. It’s possible that the researcher doesn’t realize all the thing that were manipulated and that there are many more.

What is an effect?

Something that is the reverse to a fact is a counterfactual. With experiments, we observe what did happen, but for the counterfactual we look at what would have happened. And we can never observe this counterfactual. We still try to create an approximation to this unobservable counterfactual in experiments, and then we have to understand how the given source differs from the initial condition. Often, the best approximation is random assignment that forms a control group. Even though that this control group is not perfect, because the persons in the control group are not identical to those in the treatment group.

Counterfactuals in quasi-experiments

In quasi-experiments the differences between treatment and control are normally not random, but systematic. So these nonrandom controls may not tell us much about what would have happened. These quasi-experiments make use of two different tools to do so. The first one is observing the same unit over time and the second tool is to try to make the nonrandom control groups as identical as possible to the treatment group. But we do not know all the variables, there are always unknown differences. So this creates the problem of being not as good an estimate of the counterfactual as are random controls. 

Causal relationship

According to John Stuart Mill, we have a causal relationship if the cause happened before the effect, the cause related is to the effect, and we cannot find a plausible alternative explanation for the effect. But in most studies it is impossible to know which of the two variables came first. Quasi-experiments have two ways to improve this. First, they force the cause to come before the effect by first manipulating the presumed cause and then observing an outcome afterward. Second, they allow the researcher to control some of the third-variable alternative explanations. Nevertheless, the researches almost never knows what those third variables are.

Campbell’s threats to valid causal inference

Campbell (1957) has provided a tool to identify differences between the control and treatment groups. He codified some of the most commonly encountered group differences which give reasons why researchers might make a mistake in causation. But this list is very general, even though that the threats are often context specific. Part of the critical thinking used in quasi-experimentation is to identify the alternative explanations, to see if they are plausible, and then show whether or not these alternative explanations occurred and could explain the effect.

Critical thinking in quasi-experiments means showing alternative explanations are unlikely

Falsification, created by Popper, is to deliberately try to falsify the conclusion that you wish to draw, rather than only seek information corroborating them. These conclusions are plausible until shown otherwise. Quasi-experimentation also follows this logic. Experimenters have to identify a causal claim, and then they have to generate and examine plausible alternative explanations. But there are a two problems. The first is that the causal claim is never completely clear and detailed. So mostly the claim is just changed slightly when being falsified. Second,our observations are never perfect. They always reflect our wishes, so they can never provide definitive results. So we neither confirm or disconfirm the causal claim.

Access: 
Public
The two disciplines of scientific psychology - Cronbach - 1957 - Article

The two disciplines of scientific psychology - Cronbach - 1957 - Article


Introduction

With so many different methods nowadays in psychology, it is not possible to be acquainted with all of psychology today. But when we look at the simpler times of psychology, we have two historic streams of method, which are widely used since the last century of our science: experimental psychology and correlational psychology. With both of these focussing on different things, the convergence of these two streams is still in the making.

The separation of the disciplines

Experimental psychology is when the scientist makes changes to the conditions to observe their consequences, and is the more coherent of the two disciplines. Correlational psychology also qualifies as a discipline because it asks a distinctive type of question and has technical methods to examine the question and the data. The correlator focuses more on the already existing variation between individuals, social groups, and species. Instead of the variation he himself creates, as the experimental psychologist does. With the experimental method it is a virtue that the variables are controlled, so it permits rigorous tests. While the correlator observes and organizes the data that nature has created. 

Characterization of the disciplines

Experimental psychology

In the beginning, experimental psychology was a replacement for naturalistic observation. With standardization of tasks and conditions they could get reproducible descriptions. Later the focus shifted to the single manipulated variable, and after that to multivariate manipulation. Another great development has been its concern with formal theory. But a problem that we do have in these experiments is due to individual variation. Because of this variation we get 'error variance'. We can reduce this error variance by selecting certain properties.

Correlational psychology

On the other hand, the correlational psychologist sees individual and group variation as important effects of biological and social causes. He wants to see what characteristics determine its mode and degree of adaptation, and his goal is to predict variation within a treatment.

The shape of a united discipline

It is not sufficient for each discipline to borrow from the other. A united discipline will study both experimental psychology and correlational psychology, but will also concern itself with the interactions between organismic and treatment variables. We should invent constructs and form a network laws which permits prediction. Investigators can use different methods, but still test the same theoretical propositions.

There already have been methodologies proposed for a joint discipline, one of the many choices is analytic procedures. Eventually, these two disciplines will become one, with a common theory, a common, method, and common recommendations for social betterment. A whole new dimension will be discovered and we will come to realize that organism and treatment are inseparable.

Access: 
Public
Simpson’s Paradox in Psychological Science: A Practical Guide - Kievit - 2013 - Article

Simpson’s Paradox in Psychological Science: A Practical Guide - Kievit - 2013 - Article


Simpson (1951) showed that a statistical relationship observed in a population could be reversed within all the subgroups that make up that population. This has significant implications for medical and social sciences; because, a treatment that may seem effective at the population-level may, in fact, have adverse consequences within each of the population's subgroups. The Simpson's paradox (SP) has been formally analyzed by mathematicians and statisticians. But there hasn't been much work focused on the practical aspects of the SP for empirical science; how might the researchers prevent the paradox, recognize it, and deal with it upon detection?

In this paper they state that (a) SP occurs more frequently than commonly thought, and (b) inadequate attention to SP results in incorrect inferences that may compromise not only the quest for truth, but may also jeopardize public health and policy. 

What is the Simpson's paradox?

Strictly speaking is SP not a paradox but a counterintuitive feature of aggregated data, which may arise when inferences (causal) are drawn across different explanatory levels: from populations to subgroups, or subgroups to individuals etc. 

Pearl (1999) states that SP is unsurprising, 'seeing magnitudes change upon conditionalization is commonplace, and seeing such changes turn into sign reversal, is also not uncommon'. The Simpson's paradox is linked to a lot of statistical challenges and the underlying shared theme of these techniques is that they are concerned with the nature of causal inference. According to Pearl it is the tendency of men to automatically interpret observed associations causally that renders SP paradoxical. To be able to draw conclusions, you must know what the underlying causal mechanism is of the observed patterns, and what data we observe is informative about these mechanisms. 

What is the role of the Simpson's paradox in individual differences?

Literature has documented inter-individual differences in, for example, personality. Cross-sectional patterns of inter-individual differences are often thought to be informative about psychological constructs. The idea that differences between people can be described using these constructs, mean to some that these dimensions play a causal role within individuals. But this kind of inference is not warranted: you can only be sure that a group-level finding generalizes to individuals when the data are ergodic. The dimensions that appear in a covariance structure analysis describe patterns of variation between people. 

A recent study showed that markers are known to differentiate between cultures and social classes did not generalize to capture individual differences within any of the groups. So correlations at one level pose no constraint on correlations at another level. Similarly, two variables may correlate positively across population of individuals, but negatively within each individual, over time.

In the cognitive psychology the direction is reversed within individuals is the speed-accuracy trade-off. The interindividual correlation between speed and accuracy is generally positive, within subjects there is an inverse relationship between speed and accuracy, reflecting differential emphasis in response style strategies. 

What is the survival guide to Simpson's paradox?

The Simpson's paradox occurs in a wide variety of research designs, methods and questions. So, it would be useful to develop means to 'control' or minimize the risk of SP occurring, much like we wish to control instances of other statistical problems such as confounding variables. 

What we can do is considering instances of SP that we are most likely to encounter, and investigate them for characteristic warning signals. The most general 'danger' for psychology is therefore well-defined: we might incorrectly infer that a finding at the level of the group generalizes to subgroups, or to individuals over time. There are certain strategies for three phases of the research process: prevention, diagnosis and treatment of SP. 

How do you prevent the Simpson's paradox?

The first step in addressing SP is to carefully consider when it may arise. The mechanistic inference we propose to explain the data may be incorrect, and this danger arises when we use data at one explanatory level to infer a cause at a different explanatory level. But, when you have absence of top-down knowledge, we are far less well-protected against making incorrect inferences. You have, in essence, a cognitive blind spot within which we are vulnerable to making incorrect inferences. 

When you want to be sure that the relationship between two variables at the group level reflects a causal pattern within individuals over time, the most informative strategy is to experimentally intervene within individuals. When you can model the effect of some manipulation, and therefore rule out SP at the level of the individual, the strongest approach is a study that can assess the effects of an intervention, preferably within individual subjects. 

Access: 
Public
Beyond the Null Ritual: Formal Modeling of Psychological Processes - Marewski & Olsson - 2009 - Article

Beyond the Null Ritual: Formal Modeling of Psychological Processes - Marewski & Olsson - 2009 - Article


One of the most used rituals in science is that of the null hypothesis: testing the hypothesis versus chance of chance. Although it is known to be problematic, it is often used in practice. One way to resist the temptation of using the null hypothesis is to make the theories more precise by transforming them into formal models. These can be tested against each other instead of against chance, which in turn enables the researcher to decide between competing theories based on quantitative measures. 

The randomness of the .05 alpha level gives the writer flexibility in interpreting a p-value as an indication of proof against the null hypothesis. This article is about overcoming a ritual involved in testing hypotheses in psychology: the null ritual; or the null hypothesis significance testing.  

This means that a non-specific hypothesis is tested against "chance", or it says that "there is no difference between two population means." 40 years ago, editors of major psychological journals required this ritual to be carried out in order for the paper to be published. Although methodological evidence nowadays contradicts this, the .05 alpha level is still used.

What lies beyond the zero ritual?

Rituals have a number of attributes that all apply to the null hypothesis: A repetition of the same action, a focus on the 5% (or 1%) level, fear of sanctions from journal editors and wishful thinking about the results. In its most extreme form, the zero ritual reads as follows:

  1. Set up a statistical null hypothesis with "no mean difference" or "zero correlation". Do not specify the predictions of the research hypothesis or of alternative hypotheses.

  2. Use 5% as a convention for the rejecting the null. If it is significant, accept the research hypothesis.

  3. Always perform this procedure.

Since this ritual became institutionalized in psychology, several alternatives have been proposed to replace or supplement it. The most of these suggestions focus on the way the data is analyzed. Think about effect size measures, confidence intervals, meta-analysis an resample methods. 

How is it possible that, despite attempts to introduce alternative ways, the null hypothesis is still being used, and the most? This may be due to the fact that most psychological theories are simply too weak to be able to do more than make predictions about directions of an effect. Therefore, we cannot offer an alternative form to the null hypothesis in this article, but a way to make theory more precise and to "make" it a formal model.

What is a model?

A model is a simplified representation of the world that is used to explain observed data: countless verbal and informal explanations of psychological phenomena. In a limited sense: a model is a formal instantiation of a theory that specifies the predictions.

What is the scope of modeling?

Modeling is not meant to always be applied in the same way. It must be seen as a tailor-made tool for specific problems. Modeling helps researchers to understand complex phenomena. Each method has its specific advantages and disadvantages of null hypothesis testing. Although it is also often used in other areas, it is often used in psychology for research on cognitive systems. Modeling is a complex undertaking that requires a lot of skills and knowledge.

What are the advantages of formally specifying theories?

There are four benefits of increasing precision of theories by casting them as models. 

1. A model provides a design that has strong theory tests

Models provide the bridge between theories and empirical evidence. They enable scientists to make competitive quantitative predictions that lead to strong comparative testing of theories. Making comparative testing of theories more precise ultimately leads to better systematic quantitative predictions between the theories tested. By comparing quantitative predictions with different models, the use of the null hypothesis testing can become unnecessary and useless.

2. A model can sharpen a research question

Null hypothesis are often used to test verbal, informal theories. But if such theories are not specified, they can be used post-hoc to "explain" every possible observed empirical pattern. Formal quantitative predictions are not easy to understand due to intuitive reasoning. The predictions that a model makes can only be understood by performing computer simulations. In summary, often it is only through self-modeling that someone understands what a theory actually predicts and what it cannot justify. The goal of modeling is not only to find out which of competing explanations for data is preferred, but also to sharpen the questions to be asked. 

3. A model can lead to the passing of theories that have arisen from the general linear model

Many null hypothesis significance tests only apply to simple hypothesis, for instance about linear addition effects. Scientists use available tools such as ANOVA and transform it into a psychological explanation for certain data. A prominent example is the attribution theory, which assumes that just as experimenters use ANOVA's to infer causal relations between two variables, outside the lab people infer causal relations by unconsciously doing the same calculation. But this might not be the best starting point for building a theory. Although general linear model (of which ANOVA is a way of) is a precise methodological tool, it is not always the best way to make statements or to base a theory on it.

4. A model helps to approach real-world problems

Just as the general linear model and a null hypothesis test are often inadequate when it comes to conceptualization and evaluation of a theory, factor designs can lead to testing theories under conditions that often have little to do with the real world, where the explanatory power of theories should be approved A lack of external validity can be one of the reasons why psychological outcomes make little contribution outside the lab in the real world: no person can randomly choose who they are in contact with, and no organism can "separate" correlations between life-sustaining information. But modeling, on the other hand, gives researchers the freedom to deal with natural confounders without destroying them: they can be built into the models. Modeling provides ways to increase the precision of theories. It helps researchers to quantify explanatory power. It ensures that they are not dependent on the null hypothesis. Formal statement can be linear and non-linear. By looking beyond factor designs, the possibility is created to approach real-world problems.

What are more benefits of formal modeling: an example of a modeling framework?

ACT-R is a wide, quantitative theory of human behavior that covers almost the entire human cognitive field.

Meta-analysis can be used to show that relying on significant tests slows the growth of cumulative knowledge. ACT-R, on the other hand, is a good example of how knowledge can systematically accumulate over time. ACT-R has its roots in old psychological theories, but in the end it knew its current form. This made it known that cognitive systems can give rise to adaptive processes by being transformed into static structures of the environment.

ACT-R models are specific enough for computer modeling for outcome and processes. For example, in a 2-alternative situation, reading this article or reading another article, an ACT-R model would predict which alternative would be chosen and what different reasons the model would consider before making this choice. Scientists can make the following predictions with ACT-R: (1) open behavior (2) temporal aspects of behavior (3) the associated patterns of an activity in the brain measured by fMRI scan.

In summary, modeling can promote the growth of cumulative knowledge, reveal how different behavioral activities are distributed and it can help to integrate psychological disciplines.

How do you select between competing formal models?

The comparison between alternative models is called model selection. There are a number of criteria for model selection: (1) psychological plausibility (2) falsifiability (3) number of assumptions that a model makes (4) whether a model is consistent with overarching theories (5) practical contribution. In practice, the criterion: descriptive adequacy is often used. This means that if 2 or more models are compared, the model that shows the least difference with existing data or the best fit is chosen.

A null hypothesis test is not a good way to choose between two models: if it has enough power, the test gives a significant result. But the biggest limitation of the model selection procedure based on significance or goodness-of-fit (R2) is that, on its own, these procedures do not approach the fundamental problem (choosing between two competing theories): overfitting.

What is the problem of overfitting?

To conclude that a model is better than the other using goodness-of-fit is reasonable if psychological measurements are noise-free. Unfortunately, noise-free data are practically impossible to obtain. As a result, there is an overfitting of the data: it not only captures the variance resulting from the cognitive process, but also that of a random error. Increased complexity causes a model to become overfit, thereby reducing generalizability. The generalizability is the degree to which the model is capable of predicting all potential samples generated by the same cognitive process, rather than to fit only a particular sample of existing data. The degree to which a model is susceptible to overfitting, is related to the model's complexity; the flexibility that enables it to fit certain patterns of data. 

At the same time, the generalizability of a model can positively increase the complexity of a model - but only to the point where the model is complex enough to include the systematic variations of the data. If that point is exceeded, it reduces generalizability because the model randomly absorbs variations in the data. A good fit does not necessarily ensure good generalizability for new data.

How do you select between models?

Practical

This approach relies on the intuition that when comparing models, one should choose the model that can best predict unobservable data. This can be done by testing the validity of a test (often the cross-validity test is done). A limitation of this test form is that it is not consistent. Another way to deal with this problem is to dose as many free parameters as possible. This can be done by fixing it or by creating simple models with few or no free parameters.

Simulation

By simulating the predictions of a competing model, one can gain insight into a specific behavior of a model. The results can be used to design the task to maximize the discriminatory power between models.

Theoretical

In this approach, the goodness-of-fit measurement is combined with theoretical estimates of model complexity that result in an estimate of generalizability. Generalizability (= goodness or fit + complexity) is based on the maximum log likelihood as a goodness or fit index. The complexity measurement takes different forms for different generalization measurements. The most commonly used approaches are: AIC and BIC. These are only sensitive to a kind of complexity: the number of parameters.

How do you choose between model selection approaches?

We cannot answer the question which is better, but we recommend to note the results of as many selection criteria as possible and to discuss the suitability of each criterion.

What are other pitfalls of model selection?

There are also other complications that can arise when you design and test models. If specification is the greatest virtue of modeling, it can also be the greatest curse. One must choose how a bridge can be made between informal verbal descriptions and formal implementations. This can lead to unintended discrepancies between theories and various formal counterparts. This is known s the irrelevant specification problem. 

A second problem that can arise in complex models is the Bononi paradox: when models become more complete and realistic they become less understandable and more opaque. 

Third, there can also be an identification problem. That for any behavior that exists there is a universe of different models that are all capable of explaining and reproducing the behavior. There are also an infinite number of vague and informal theories going around for which nobody will ever be able to decide whether one is better than another. 

Conclusion

Although modeling seems better from a scientific point of view, few use this approach to test. This is because it requires a lot of effort, time and knowledge. Accepting the null hypothesis test in laboratory settings leads to a reduction in the incentive for scientists to design models that explain issues in the "real" world. However, there is not often a better alternative to test and make models, because informal theories are not specific enough and they are tested against the chance of chance. With some knowledge and training, modeling can be performed with little effort.

Access: 
Public
Evaluating theories - Dennis & Kintsch - 2008 - Article

Evaluating theories - Dennis & Kintsch - 2008 - Article


Introduction

According to Popper all theories are false, so this should mean that evaluating theories is very straightforward. But some theories are more false than others and some theories help in the advance of scientific knowledge because of their characteristics. A theory is a concise statement about how we believe the word to be. These theories are used to organize the observations of the word and give researchers the opportunity to make predictions about what will happen in the future. In recent years, more interest has grown towards how theories can be tested with formal models.

Criteria on which to evaluate theories

Descriptive adequacy

To judge a theory on the extent to which it accords with data is probably the most important criterion. Different domains have different ways they employ data, so it’s an issue to understand what the “right” data is. But across all these domains, data is still very important and theories that are consistent with data are preferred. In psychology, using null hypothesis significance testing is a popular way to compare a theory against data. This involves creating two competing hypotheses, one would be true if the theory is correct and the other one would be false.

A difficulty with this hypothesis testing is that it is not possible to conclude that there is no difference. Another difficulty is that it is possible that researchers will generate an endless set of issues and achieve little cumulative progress. The advantage of the formal model is that they can precisely say how closely it approximates the data and give us additional information about the nature of the relationship between variables.

Precision and interpretability

For a theory it is important that it is described in a precise fashion and it can be interpreted easily and unambiguously. Many theories are often described vaguely, which makes it unclear how data would invalidate such theories.

Coherence and consistency

Another criterion for a good theory is its coherence and consistency. There shouldn’t be any logical flaw or circularity. It is also important to ask how consistent a theory is, both with theories within psychology and outside psychology.

Prediction and falsifiability

The formulation of the theory has to be in such way that critical tests can be conducted that could lead to the rejection of the theory. Even though falsification provides the useful information in moving forward in scientific knowledge, sometimes confirming predictions can increase our confidence in a theory. Surprising predictions, that do not seem to get along with our intuitions and still turn out to be the case, provide more support for a theory than unsurprising predictions.

Postdiction and explanation

The theory has to provide a genuine explanation of existing results. Postdictive explanations are not as strong as predictive explanations, but they are still explanations. Our explanations of behavior are often postdictive. In psychology there is no reason to think we will ever have all the information needed to make a precise prediction of his or her future acts. So only in limited circumstances prediction can be our goal.

Parsimony (Occam’s Razor)

Theories should be as simple as possible. Only the things needed to explain a phenomenon should be included. It is also important to consider the range of data sets the theory can fit.

Breadth

Theories should try to be as broad as possible, while still maintaining the other criteria that were discussed, such as descriptive adequacy.

Originality

Even though theories may look completely different, and are different in their broader implications, it can be impossible to differentiate them when it comes to a specific set of data. So we have to be very careful when we are comparing theories against each other.

Usability

Good scientific theories should be useful in addressing societal problems. The best research contributes to scientific understanding, while also fulfilling a societal need.

Rationality

The claims that are made with the theory should seem reasonable. The environment has adapted the cognitive system in many years through evolution. It is adapted to the way information is distributed in the environment. So the information should seem reasonable in this shaped environment. It is not easy to get the relevant environmental statistics, but when it is possible, it provides convincing support for a theory.

Conclusions

We have to take multiple considerations into account in the evolution of theories. Every theory is different and this creates different demands on the reader, so the weighting of the points discussed depends on many factors. For every case factors have to be weighted carefully.

Access: 
Public
Karl Popper and Demarcation - Dienes - 2018 edition - Article

Karl Popper and Demarcation - Dienes - 2018 edition - Article


What are the degrees of falsifiability?

A potential falsifier of a theory is any potential observation statement that would contradict the theory; for instance 'Peter the swan is black' is a falsifier of the hypothesis that 'all swans are white'. One theory can be more falsifiable than another if the class of potential falsifiers is larger. Therefore scientists prefer simple theories, because they are better testable. On the basis of not falsifiable theories Meehl criticized much psychology. 'Group A will score differently from Group B' also rules out virtually nothing and is a very weak theory. 

A theory can gain in falsifiability not only by being precise but also by being broad in the range of situations to which the theory applies. The greater the universality of a theory the more falsifiable it is, even if the predictions it makes are not very precise. 

Revisions to a theory may make it more falsifiable by specifying fine-grained causal mechanisms. As long as the steps in the proposed causal pathways are testable, specifying the pathway gives you more falsifiers.

Psychologists sometimes theorize and make predictions by constructing computational model, this is a computer simulation of a subject. In order for the model to perform, the free parameters have to be set in particular values, but you can't directly observe the values of these parameters. 

With computational models it can be difficult to predict how the model will do just by thinking about it. The model has to be run and its behaviour observed and analyzed. Often modellers just try to find any set of parameters values that fits the data. If the best-fitting model of each fitted about as wel, the modeler may conclude that there is no reason to prefer one model than another. Popper's idea showed that is inadequacy in simply finding the best-fit models. If a model has passed more severe tests it was more falsifiable to begin with. 

A theory that allows everything explains nothing, the more a theory forbids, the more it says about the world and the 'empirical content' of a theory increases with its degree of falsifiability. But also, the more falsifiable a theory is, the more it is open to criticism. So, the more falsifiable, the faster you can make progress, given progress comes from criticism. 

Popper stated that good science shows itself not just by the simple literal form of its theories, but also by the nature of history of its theories leading to the current proposals. When you have the hypothesis 'all swans are white', you can find one exception 'Peter the swan is black'. This amendment to the theory is adhoc and decreases the falsifiability and is unsatisfying. Popper proposed that revisions and amendments should always increase falsifiability. In psychology, attempts to save theories and to not make new measures are often called 'post hoc' .

Falsifiability: too strong a criterion or too weak?

There are two criticisms of Popper's approach:

  1. No theory is falsifiable at all 
  2. All theories are falsified anyway

Critics often focus on the fact that accepting an observation statement involves accepting various levels of theory as well as the theory under test. There is no general method of determining which of the theories should be rejected when an apparent falsification occurs. When you have a falsification, how do we know which component of the system to reject? This is a widely recognized problem and is called the Duhem-Quine problem

Access: 
Public
Scaling - Furr & Bacharach - 2014 - Article

Scaling - Furr & Bacharach - 2014 - Article


In psychological tests, grades are assigned to traits to show the difference between the traits of the different test subjects. Measuring is the assignment of a figure to objects or to characteristics of individual behavior according to a certain scale. Scaling is the way numbers are assigned to psychological traits.

What are the fundamental problems with numbers?

In psychological measurements, figures are used to show the level of a psychological characteristic. The figures can apply to different properties in different ways.

Identity

The most important thing when measuring the characteristic is looking at the differences and similarities between people. With the differences one can divide the test subjects / objects into categories. The categories must satisfy a number of points. First, all test subjects within a category must be the same on the attribute that this category represents. Secondly, the categories must be reciprocally exclusive. This means that each test subject can only be classified in one category. Thirdly, no persons may fall outside the categories. Numbers are only used here as a label for the categories. They have no mathematical value: quantitative significance cannot therefore be considered.

Rank order

The ranking of the figures contains information about the relative size of a property that people possess. So whether you possess a trait to a greater or lesser extent compared to the other people in the category. Here too, the figures are only a label. They give a meaning to the ranking within the category, but have no mathematical meaning.

Quantity

Indicating the quantity provides the most information. With regard to quantity, the figures are given per person and it is therefore possible to look at the precise difference between two people. At this level, the figures also have a mathematical meaning, with these figures calculations can be made. When psychological measurements are made, it is often assumed that the scores contain the characteristic of quantity. But, as will be discussed later, this is rarely a good assumption.

The number zero

There are two potential meanings of zero. Zero can mean that the object or person does not exist (absolute zero). This is for example at the reaction time. Zero can also be an arbitrary amount of a property (arbitrary zero). In this case one can think of a clock or thermometer. It is important to see whether the zero in a psychological test is relative or absolute. It is possible that the test indicates zero while the person has that characteristic. Then you can take it as a relative zero while it was initially intended as an absolute zero. Identity, ranking, quantity and the meaning of zero are important issues in understanding scores on psychological tests.

How can the measured variable be determined?

If the property of quantity is used, the measurements must be clearly defined. An example is length. If you want to know the length of something, you can measure it with a ruler. The ruler is divided into centimeters, so you can now measure the length per centimeter. In psychology, the measured quantity is often less clear / self-evident. There are three ways in which measurement variables can be arbitrary.

One way is that the size, height, weight, etc., of a random unit are chosen. This is a decision later on will be secured.

The second way is that the units are not tied to one type of object. Units can be applied to many types and many different objects.

The third way is that units can serve different types of measurements. An example is a piece of rope with which you can measure length, but you can also use the piece of rope to measure the weight of something.

If the units are in physical form, standard measurements are based on the three points mentioned above. And are therefore arbitrary on all three points. The measurements in the psychological world are generally only arbitrary at the first point. So you can choose what the unit means and what size is used. But with these measurements, the units are usually tied to a specific object or dimension. An important exception is that standard measurements are sometimes used to measure psychological characteristics. Such as cognitive processes that are measured by a person's reaction speed.

What role do adding and counting play in psychometrics?

Both in the physical and in the psychological world, counting is important in the measurements that we perform.

Adding

An important assumption is that the measurement size of the unit does not change when counting the units. Every piece of unity is the same. With the addition of a unit, one is added every time. This is constant. Even if the conditions of the measurement change, the size of the unit remains the same ( conjoint measurement ). With a questionnaire there are questions that are easy and questions that are difficult. As a result, with most questionnaires one point cannot be awarded for each question. More points can be awarded for questions that are more difficult. But how many points do you assign to a question? This gives a paradox: We want to translate a psychological characteristic into a list of figures to look at the quantity, but this is not exactly because we do not know how much precise unity there can be with a psychological characteristic.

Counting

A point of controversy about the relationship between counting and measuring arises when we start to count things instead of properties. Counting is only equal to measuring when the quantity of a characteristic or property of an object is reflected.

Which measuring scales are there?

Measuring is the addition of figures to observations of behavior in order to clearly see the differences between psychological traits. There are four measurement levels, or four scales: nominal, ordinal, interval, and ratio.

The nominal scale

The nominal scale is the most fundamental level of measurements. At the nominal scale, the test subjects are divided into groups and those who are equal to each other are classified together. So there are differences between the groups. You can assign numbers to the groups, but those numbers only give meaning to the group. So it cannot be taken into account. In daily life, figures are also assigned to individuals, but this does not belong to the nominal scale. It is important to make clear what the numbers belong to. Or to individuals or to a group (nominal measurement level).

The ordinal scale

On the ordinal scale one can look at qualitative differences between the observations of behavior. Here numbers are assigned to individuals within a group and to these numbers one can see the ranking of the individuals. These figures only indicate whether you possess a trait to a greater or lesser extent compared to the other people in the group. This does not say anything about the extent of that person's property.

The interval scale

The interval scale goes one step further than the ordinal scale. In this case, if you assign numbers to the groups, the numbers also represent a certain amount. These figures represent quantitative differences between people on the trait being measured. Furthermore, the interval scale has an arbitrary zero. If there is a zero in the list, it does not mean that the unit is absent. With the interval scale you can add and subtract quantities, but you cannot multiply them. Many psychological tests are used and interpreted as if they are based on an interval scale, but in fact the majority of all psychological tests are not based on an interval scale.

The ratio scale

Ratio scales have an absolute zero. So if there is a zero in the list, it means that the unit is absent. It is also possible to multiply at the ratio scale, which is not possible at the interval scale. According to most test experts, there are no psychological tests that contain the ratio scale. When measuring the reaction time one would think that the ratio scale is used, but this is not the case because there is no single person who can respond in zero milliseconds.

What implications does scaling have?

In theory, it is possible that a score zero at an interval scale means a quantity of a property, but it does not mean that the property is completely absent.

For tests with dichotomous variables, the binary codes (0 and 1) can be used. Depending on the characteristic being measured, it can be interpreted as a nominal scale or as an interval scale.

Access: 
Public
Statistical treatment of football numbers - Lord - 1935 - Article

Statistical treatment of football numbers - Lord - 1935 - Article

[toc]

Professor X sold 'football numbers'. The audience had to have a way of telling which football player was which, so each player had to wear a number on his football uniform. It didn't matter which number, as long as it wasn't more than a two-digit number.

Professor X loved numbers, when tests were made he couldn't wait to put the scores in his back pocket and hurry back to his office where he would lock the door, add the scores up, and then calculate the means and standard divisions for hours on end. Test scores are ordinal numbers, but ordinal numbers cannot be added. So, the professor came to the conclusion that it was wrong to compute means and standard deviations of test scores. This eventually lead to a nervous breakdown and Professor X retired. As an appreciation of his work, the school gave the Professor a 'football numbers' concession.

The Professor made a list of all the numbers given to him, and he found out he had 100,000,000,000,000,000 two-digit numbers to start with. The numbers were ordinal numbers, and the professor was tempted to add them up, square them and to compute means and standard deviations. But these numbers were only 'football numbers', such as letters of the alphabet. The teams brought there numbers to the professor, first the sophomore team, a week later the freshman team, but by the end of the week there was trouble. Information secretly reached the professor that the numbers in the machine had been tampered with. The freshman team appeared in person to complain. They said they had bought 1,600 numbers from the machine and complained that the numbers were too low and that they were laughed at because of the low numbers. 

The professor persuaded the freshman team to wait while he consulted the statistician who lived across the street. Maybe the freshman team got low numbers by chance. The statistician took the list of the professor, added them all together and divided them. 'The population mean', he said 'is 54,3'. But the professor expostulated 'You can't add them!'. 'Oh, I can't..? I surely just did it' said the statistician. 'But these are ordinal numbers, you can't add and divide them!'. 'The numbers don't know that' said the statistician. 'Since the numbers don't remember where they came from, they always behave just the same way, regardless'. And if you doubt my conclusions, I suggest you try and see how often you can get a sample of 1,600 numbers from your machine with a mean below 50,3 or above 58,3'. 

To date, the professor has drawn over 1,000,000,000 samples of 1,600 from his machine and only two samples were below 50,3 and above 58,3. He is happy because he is adding and dividing the football numbers that were given to him. 

Access: 
Public
Fearing the future of empirical psychology: Bem's (2011) evidence of psi as a case study of deficiencies in modal research practice - LeBel & Peters - 2011 - Article

Fearing the future of empirical psychology: Bem's (2011) evidence of psi as a case study of deficiencies in modal research practice - LeBel & Peters - 2011 - Article


Introduction

Bem, a researcher, reported a series of nine experiments in the Journal of Personality and Social Psychology. He claimed these experiments to be proof for the existence of psi, which he defined as the anomalous retroactive influence of future events on an individual's current behavior. Lebel, the author of our article, uses Bem's article as a case study to discuss important problems with modal research practice (MRP). The MRP is the most accepted methodology that empirical psychologists most commonly use in their research. Lebel also discusses how to improve this methodology.

Lebel states that Bem's report are of high quality. Therefore, it is a good example to discuss the problems with MRP. He discusses the ways in which Bem's results are not due to the real effect of psi, but due to the deficiencies of MRP. Lebel focus on three methodological issues in Bem's article, which reflect a general deficiency in MRP: an overemphasis on conceptual replication; insufficient attention to verifying the integrity of measurement instruments and experimental procedures and; problems with the way that the null hypothesis significance is tested (NSHT). These deficiencies lead to an interpretation bias of the data in empirical psychology. According to Lebel, Bem was able to publish his article in a high quality journal, only because of the use of MRP instead of the real results.

Lebel states that he uses Bem's article as an example, but that his criticism and recommendations for improved practices, apply to all the research that is conducted within the MRP tradition.

The interpretation bias

Lebel states that empirical data undermine the choice of theory. This means that there are always alternative explanations of data. This is the case when the data supports the researcher's hypothesis and when the data does not support this hypothesis. However, in MRP, deficiencies lead to biases in the interpretation of data, which is called interpretation bias. Interpretation bias means that there is a tendency towards interpretation of the data in a, for a researcher, favorable way. this is true whether the hypothesis is supported or not. Regardless of the data, the theory that is set up in MRP is not falsifiable. Therefore, there is an increased risk of reporting false positives and disregarding true negatives. Eventually, this leads to wrong conclusions about human psychology.

Lebel states that this interpretation bias has nothing to do with bad intentions of the researcher. It is merely the methodology that is deficient. He states that it is a systematic bias of MRP.

Conservatism in theory choice

Lebel discusses the problem of 'theory choice' in science to explain how the MRP has serious deficiencies. He explains that the knowledge system in sciences such as psychology are divided in two types of beliefs: theory-relevant beliefs (which are about the theoretical mechanisms that produce behavior), and method-relevant beliefs (which are about the procedures through which data are produced, measured and analyzed). In any experiment in which is a hypothesis, the interpretation of data relies on both types of beliefs. So, the data can be interpreted as theory relevant or as method relevant.

Lebel states that deficiencies in MRP systematically bias: the interpretation of confirmatory data as theory relevant; and the interpretation of disconfirmatory data as method relevant. This means that the researcher's hypothesis is protected from being falsificated. Lebel states that this is not good: the interpretation of data should not be dependent on the beliefs that are about theory or method, but rather on how central these beliefs are. Central beliefs are beliefs on which many other beliefs and theories are built. Peripheral belief are beliefs which not many beliefs are built on.

Deficiencies in MRP

Overemphasis on conceptual replication

Bem's nine experiments are 'conceptual replications'. This means that none of these experiments were replicated exactly. This is in line with MRP, in which a statistical significant result is followed by a conceptual replication with the goal of extending that underlying theory. However, when this conceptual replication fails, it is not clear whether this is due to the theory being false or due to methodological flaws during the replication. The most researchers choose for the latter when they produce a false conceptual replication. Then, they proceed with another conceptual replication, until they find a satisfactory significant finding.

So, the researcher has too much freedom in interpreting his or her results during conceptual replications. Also, the studies which count as replications are made after that they have been performed, and these choices are influenced by the interpretation bias. In other words, a successful replication is published, while many failed replications end up in the 'file drawer'.

Integrity of measurement instruments and experimental procedures

In Bem's article, he did not report anything about the integrity of the measurement instruments and the experimental procedures. He did not provide any reliability estimates for these measures and procedures. So, it is not clear whether these measurements are fit to measure the dependent variable in his article. This lack in verification of the integrity of the measurement instruments and experimental procedures weakens method-relevant beliefs. Lebel states that is indeed difficult to determine whether a manipulation (treatment) or measurement is a good one. This problem reflects the problem of construct validity in psychology. This problem arises because psychological processes are very context specific. This makes it difficult to know if a manipulation or measurement is indeed valid.

Problems with NHST

Bem uses the null hypothesis significance as the only criterion to determine his conclusion. Lebel state that this is not a good operation, because: the standard null hypothesis that there are no differences is almost always false, and it can lead to bizarre conclusions about the data. NHST, null hypothesis significance testing, is a general practice in MRP. In MRP, the null hypothesis is often formulated as a "nil hypothesis", which states that the means of different populations are identical. This is not a good hypothesis, because it is almost always false: differences between populations are inevitable. Lebel states that the use of NHST undermines the rigor of empirical psychology, because it increases the ambiguity of theory choice.

Strategies for improving MRP

Lebel's recommends strategies to improve research practice in psychology. He states that the methodology must be made more rigorous, by strengthening method-relevant beliefs.

Recommendations for strengthening method-relevant beliefs

Stronger emphasis on close replication

Lebel states that MRP would improve if there would be a stronger emphasis on close replication, compared to conceptual replication. Close replication is a core element of science. It determines whether an observed effect is real or is only due to sampling error. Close replication would only be achieved through "parallel experiments" and should lead to increased confidence after each replication.

Verifying integrity of methodological procedures

To make beliefs about methods stronger and thereby improve the validity of MRP findings, it is critical to verify the validity of the measurement instruments and procedures. Lebel recommends that there should be pilot studies which are explicitly designed to fine tune manipulations or measurement instruments. It should also be standard practice to check the internal consistency of the scores of a measurement instruments and to confirm measurement invariance of instruments across conditions. Lebel also recommends to develop an empirically supported account of how context sensitivity of mental processes vary under different conditions, to deal with the construct validity problem in psychology.

Use stronger forms of NHST

According to Lebel, a strong form of NHST should be used. For example, as is the case in astronomy and physics, significance tests should be treated as just one criterion. He argues for Bayesian analytic techniques, which incorporates base rate information into hypothesis testing.

Recommendations for weakening theory-relevant beliefs

Finally, Lebel states that psychological hypothesis should be set up in a way that could make disconfirmation possible.

Access: 
Public
Popularity as a poor proxy for utility - Mitchell & Tetlock - 2017 - Article

Popularity as a poor proxy for utility - Mitchell & Tetlock - 2017 - Article


Introduction

The implicit prejudice construct is a construct that has moved extremely fast in the recent years from psychology journals into other academic disciplines, newspaper editorials, courtrooms, boardrooms, and popular consciousness. This term appears for the first time in the PsycINFO database less than twenty years ago. A Google search for “implicit prejudice” between the years 1800 and 1990 would return only six hits, but in the years 1991-2006 it would return over 8400 hits. In Google Scholar, you would get 46 hits for the term “implicit prejudice” in articles published between 1800 and 1990. In the years after 1990, however, it would return 3700 hits. In 1998, an article about the implicit association test (IAT) was published. This is now the most popular method for studying implicit prejudice. The IAT article has been published more than 3700 times in PsycINFO, 3400 times in the Web of Science database and 7500 times in Google Scholar.

The authors of this article state that the psychological construct “implicit prejudice” is not a good construct. They state that even though it is a very popular construct, it is misunderstood and lacking in theory and in practical use. For example, scholars do not agree about how implicit prejudice is linked to other forms of prejudice. So, it is unclear if the different measures of the construct all measure the same construct. Also, the meaning of “implicit” is challenged by some researchers. When we look at the practical use of the term implicit prejudice, the results show that implicit measures of prejudice do not predict behavior better compared to traditional explicit measures of prejudice. In this article, they discuss how the psychological research behind the construct is inadequate to support the widespread use of the implicit prejudice. They explain by this by first discussing how the term was created and how psychologists marketed the ideas. Later, they discuss their view on how the construct should be renovated. They also discuss how such popular ideas within social psychology are hard to overcome.

The implicit prejudice

There are two eras in the history of the implicit prejudice construct. The first era is the pre-IAT era. In this era, psychologists mainly developed indirect measures of prejudice with the goal of overcoming response biases. They also mainly looked at automatic processes that lead to prejudice. The second era is the post-IAT era. This is the era in which the term became popular in public discussions and even in academic ones. In these discussions, people used the term implicit prejudice for other, widespread unconscious prejudices. Then, in 1998 an article was published in Psychology Today, which was stated that they found extraordinary results. By studying the unconscious, they found that everyone uses stereotypes, all the time without that people know it. In this article, they also stated that it is more difficult to avoid the negative effects of implicit prejudice compared to the negative effects of explicit prejudice. They author of the 1998 Psychology Today article states that: “Even though our internal censor successfully retains from expressing overly biased responses, there is still danger of leakage. This leakage shows itself in our non-verbal behavior, such as our expressions, our stance, how far we are from another person and how much eye contact we make.”

The IAT article introduced the idea that automatic stereotyping and unintentional prejudice operate in automatic and subconscious ways. The IAT is said to measure widespread implicit preferences that people may have for majority groups compared to minority groups. Even people who are from a minority group themselves, might show a preference for majority groups over minority groups. These preferences would be more predictive of behavior compared to explicit measures of prejudice.

The term “implicit prejudice” became very popular after 1998, because of the IAT article. The IAT Itself also became popular in the media and on the radio. For example, Greenwald discussed implicit bias and the behavioral effects of implicit bias on the radio. He said that: “People are not actually aware that they have this bias. But, it can still affect their behavior and act on them. It can produce unintended discrimination, because it can produce discomfort in interracial interactions.” The IAT was also discussed often in opinion websites, educational websites and in popular science writing.

The authors of this article, states that even though the IAT became very popular, it is not theoretically well grounded. Also, there have been no positive impacts of IAT research. The predictive validity of the implicit prejudice measure also has not shown to be higher than the predictive validity of explicit prejudice measures. The IAT also has not lead to any solutions for discrimination.

What Is implicit prejudice?

There are two important themes repeatedly found in tests the articles of the implicit prejudice. First, reaction-time based measures of prejudice, such as the lexical decision task (Wittenbrink, 1997) and the IAT, measure implicit prejudice, are meant to avoid the influence of normative pressures, such as the social desirability. The social desirability bias is the tendency of respondents to answer survey questions in a socially desirable way. Second, psychologists believe that they can tap into subconsciousness, decisions and behaviors. The implicit prejudice construct therefore also reflects views about attitudes and about the influence of automatic psychological processes and their influence on behavior. However, these themes are highly related to each other. This makes the authors of this article question: are the processes really implicit or is it only the measure that is an implicit measure of prejudice? This distinction between the meaning of “implicit” as in the processes or as in the measurement, is a highly discussed topic. The people who view the nature as the process, disagree about the nature of these processes. For example, IAT researchers state that implicit biases sometimes operate in subconscious ways. However, sometimes individuals are aware of their biases, but can simply not control them. The authors of this article state that it is also not clear if the two most reliable measures of prejudice, being the IAT and the “Affective Misattribution Procedure” (AMP), are implicit measures.

According to Greenwald & Nosek, the distinction between explicit and implicit prejudice is not empirically testable. Also, many automatic processes of prejudice and stereotyping, for example aversive racism, are included in the description of the implicit prejudice construct. However, the theory about aversive racism is different than the theory behind the implicit prejudice.

Another argument that the authors of this article give for their idea that the implicit prejudice construct need to be criticized, is that different implicit measures of implicit prejudice, produce different patterns of results. This is also true for similar measures. The measures also do not correlate highly.

The predictive validity of implicit prejudice measures

The IAT has face validity: this means that we think we know what it means when respondents say that they like one group more than another group. However, when the implicit prejudice is a measure of intergroup prejudice, it lacks face validity. The authors of this article state that for implicit measures of prejudice, predictive validity is crucial. Because, if what is measured by the IAT or another implicit measure reliably predicts behaviour that can lead to intergroup conflicts, then this would lead to better theoretical clarity about the underlying processes.

The measures of implicit prejudice are of an indirect nature. Many measures are reaction-time-based studies. In these studies, millisecond differences in response times, are seen as evidence for a prejudicial attitude. However, if these measures do not predict any judgments or behaviors, then this concept is meaningless.

The defenders of the implicit prejudice construct, posit as an argument for the construct that it predicts discriminatory behavior. One reviewer of Blindspot wrote when discussing the IAT: “The best indicator of the test’s validity is it’s prediction of behavior”. However, the authors of this article question whether the IAT predicts discriminative behavior better than explicit measures. This does not seem to be the case. Greenwald and colleagues (2009) reported in a meta-analysis that explicit measures predicted judgments, decisions and behavior better in seven of nine criterion domains. The IAT outperformed explicit measures in one area, but according to Greenwald and his colleagues this was due to poor performance of the explicit measures. The authors of this article also conducted an updated meta-analysis, based on that of Greenwald and colleagues. They found even lower estimates of the predictive validity for the IAT! The authors also found explicit measures to be more predictive than implicit measures in studies that looked at response times. This is in contrast with the idea that implicit construct resembles subtle, automatic behaviors compared to deliberate, controlled behaviors. The authors also found a lot of differences in the results across studies. They found that the variance was much greater than the estimated effect size.This means that the IAT will be a poor predictor of whether someone will act fairly or unfairly toward a minority group member. So, the popular ideas about IAT being better predictor of discrimination compared to explicit measures, is false. This is not true.

The score interpretation problem

According to IAT results, 75% of the American population is described as implicitly “racist”. According to the authors, there are no studies that provide criteria for the IAT scores to be nonexistent, low, moderate or high. So, the feedback given by the IAT is not based on the external validation. It is only based Cohen’s test of effect sizes. However, this test has the rule of thumb that effect sizes should be considered only on the base of practical meaning and significance. So, this is a noncorrect use of the Cohen’s effect size rule of thumb. According to the authors, if the IAT creators differently at their results, they could give the participants very different feedback. This has been shown earlier: the IAT researchers changed their criteria for the extremity of their scores. This lead to that the percentage of people showing strong anti-black bias on the IAT decreased from 48% to 27%. This change was not due to societal shifts or due to studies. It was solely due to the researchers’ change in definitions.

The authors state that because of this freedom of researchers to make important conclusions, it can be used in a negative way and lead to mischief. The conclusion of the authors is that the scores of the IAT have no clear meaning. This means that when one person scores higher on the IAT compared to another person, this does not mean that this person is more likely to express bias outside the testing context

The implicit sexism puzzle

Research has found that men usually do not exhibit sexism, while women do show pro-female implicit attitudes. This finding is not in line with the common findings in IAT. In the IAT, the findings are that the historically advantaged groups, men, are favored by both the members of this group and the disadvantaged groups. Therefore, the IAT is often used in explanations of female victimization. Because of this, new findings about sexism should be taken into account, but these findings have received little attention in the literature about the implicit prejudice construct and the IAT. The IAT researchers have responded to this finding by showing that men are more often associated with math and science and with limited leadership qualities. So, an IAT is often used to asses whether men or women are more often linked to science and math, but not to asses implicit attitudes toward men and women. According to the authors of this article, this focus on implicit gender stereotypes, is problematic because of practical and theoretical reasons. First, implicit measures of these stereotypes are not predictive of discriminatory behavior. Second, only a few implicit gender stereotypes have been examined. For example, research has not looked at whether traits of good managers such as cooperativeness, fairness and integrity are more strongly associated with women than men. So, the IAT does not measure all the stereotypes that there are. Also, there is no reason to believe that these implicit gender stereotypes are true for women for all positions.

Thirdly, there also seems to be a problem between implicit attitudes and automatic semantic associations. If prejudice and discrimination depend on how feelings and beliefs interact, then an interaction would be needed at the implicit level too. The authors also state that a contextualized model is needed, in which it is clear when and where each component of implicit prejudice will be predominant and in what behavioral form.

Bias is everywhere

According to Banaji and Greenwald, implicit attitudes are just evaluative associations of varying strengths with attitude objects. It does not matter if these objects are products, places or people and it also does not matter whether the source of these associations are cultural information or personal experiences. In this view, implicit prejudice is just an evaluative knowledge about different groups. Some evaluations are more negative than others.

The subjective judgment problem

One solution that is proposed to prevent the influence of implicit bias, is to objectify judgment and decision-making processes. This means that there should be only objective measures of performance. However, the articles positing this solution do not take into account the findings that these subjective evaluation criteria are not associated with discrimination against women and minorities.

The contributions of the IAT

The only positive of the IAT, according to the authors of this article, is that humans have implicit biases, at high prevalence rates. But, the bias categories of the IAT have no external validity. The IAT scores are only a reliable predictor of future IAT scores. Also, even though there has been a lot of research conducted into the implicit prejudice and the IAT, there is still a lot of confusion about the nature of implicit prejudice. In the practical use, claims and conclusions often depend on the interpreter of the results. The authors criticize the inventors of the IAT and state that they developed a test that reliably produces statistically significant results. This makes it easier for researchers to use the IAT for their own purposes and. Also, the more researchers publish results based on the IAT, the higher the motivation for justification of the tool and its results.

Why does the implicit prejudice still persist?

The authors of this article state that the IAT is very popular among psychologists, because it makes it easy to produce statistically significant effects. They state that even though there is a lot of criticism on the IAT, it is still very popular and this is due to ideological ideas, publication bias and the lack of clear, consensual score-keeping measures within the social psychology. The publication bias is about that it is favored to publish significant experimental results. The IAT studies are therefore easier to publish and popular among psychologists. According to the authors, the research domain of implicit prejudice is a good domain for experimenting with extraordinary science. They state that the rules for ‘success’ should be set ex ante, rather than allowing test-takers to post-hoc modify any pattern of results according to their preferred theories.

Access: 
Public
Introduction to qualitative psychological resarch - Coyle - 2015 - Article

Introduction to qualitative psychological resarch - Coyle - 2015 - Article


Introduction

In the last few decades, there has been a shift in the methodologies used in psychological research. The author of this article refers to his student times in the mid 1980s: he learned that acceptable psychological research involved the careful measurement of variables, the control of other variables and the appropriate statistical analysis of quantitative data. So, he was never told about conducting psychological research using qualitative research methods. However, around 1990, qualitative research was acknowledged and recognized in its contribution to the discipline. This was seen in, for example, the growing number of qualitative articles in peer-reviewed psychology journals. Now, qualitative work has become widely accepted in many branches of psychology, and especially in social psychology, health psychology, feminist psychology, psychotherapeutic and counseling psychology, clinical psychology and educational psychology. 

The author describes the development of this process and the benefits of this shift. He also describes important issues and developments in qualitative research. 

Epistemology and the 'scientific method'

At the basic level, qualitative psychological research aims to collect and analyse non-numerical data through a kind of lens. Willig (2013) stated that most qualitative researchers want to understand "what it is like" to experience a certain condition (for example, how does it feel to live with a chronic illness or how do people experience being unemployed?). They are also interested in how people manage things (for example, how do people manage a good work-family balance?).

There are certain assumptions about epistemology in qualitative research. Epistemology refers to the bases or possibilities to gain knowledge. It has its base in philosophy. To be more clear: it tries to answer how we can know things and what we can know. 

Ontology is another concept, which refers to the assumptions we make about the nature of being, existence and reality.

So, different research approaches and methods have different epistemologies. Qualitative research involves a variety of methods with a range of epistemologies. Therefore, there are a lot of differences and tensions in the field. When a researcher favours a certain epistemological outlook, then he or she may choose methods that fit this position. Whatever position the researcher picks, he or she should be consistent throughout his research, so that he or she can write a coherent research report. However, sometimes a more flexible epistemological position is required (such as when one uses different epistemologies within the same study).

Often, in research designs, there are no discussions about epistemology. The authors explains this by saying that most of the research approaches that are adopted by researchers, are often taken for granted. For example, think about positivist-empiricist and hypothetico-deductive epistemology. Positivism means that the relationship between the world and our sense perception of the world is straightforward. So, there is a direct relationship between things in the world and our perception (when our perception is not skewed by factors that might damage that correspondence, such as our specific interests). So, positivism holds that it is possible to obtain accurate knowledge about the world around us when we are able to be impartial, unbiased and objective. Empiricism refers to that our knowledge of the world arises from the collection and organization of our observations of the world. With the use of categorization, we can develop a complex knowledge of the world and develop theories to explain this world. 

Most researchers are now aware that positivism and empiricism are not always accurate: they recognize that our observations and perceptions of the world are not purely 'objective' and do not directly provide 'facts' about the world. However, the claim that we need to collect and analyse data to understand is still central in research. Qualitative researchers agree with this. However, they have different ideas about what is appropriate data and how this data should be generated and analysed. 

A theory that was developed as a response to the shortcomings of positivism and empiricism is the hypothetico-deductivism theory of knowledge. Karl Popper (1969) believed that there is no scientific theory that could be definitively verified. Instead, he argued, the aim should not be to obtain evidence to support a theory, but to falsify hypotheses. So, research that adopts a hypothetico-deductive approach, seeks to develop hypotheses and test these hypothesis. The underlying thought is that by identifying false claims, they can develop a clearer sense of the truth. They use deductive reasoning: they start with existing theories, which are defined into hypotheses, which are tested through observations, which eventually lead to confirmation or rejection of the hypotheses. This is also called a 'top-down' approach. 

The 'scientific' method refers to identification with the assumptions of positivism, empiricism and hypothetico-deductivism. In this method, it was assumed that a reality exists that is independent of the observer and that we can access this reality through research (which is called the ontological assumption of 'realism'). Researches thought that for accurate and objective information, researchers needed to be detached from their research, so that they could not 'contaminate' the research process. So, contact between researchers and participants was minimized or standardized (all participants received the same instruction). So, the researcher is 'erased' from the research process. This is also visible in the use of passive language instead of personal language. For example, instead of saying: "I developed a questionnaire", a researcher would write: "A questionnaire was developed". 

Qualitative research might also be conducted in light of this scientific method. For example, if there is an area that has never been researched before, then qualitative research might identify key elements in that area, which then can be tested with the use of measurement instruments such as questionnaires. There are also some qualitative research methods that fully adopt the scientific method. For example, Krippendorf (2013) came up with the structured form of content analysis. In content analysis, qualitative data is quantified very systematically. It is also concerned with reliability (which is not often the case in qualitative research). This is called the 'small q' qualitative research and is defined as research that uses qualitative tools and techniques within a hypothetico-deductive framework. 'Big Q' qualitative research, in contrast, is defined as the use of qualitative techniques within a qualitative paradigm. 'Big Q' qualitative research rejects the idea of objective reality or universal truth. It emphasizes understanding of the context. In this book, all the information refers to 'Big Q' qualitative research.

Understanding individuals in context

Wilhelm (1894) stated that human sciences should focus on establishing understanding rather than causal explanations. This idea was echoed in the nomothetic-idiographic debate of the 1950s and 1960s. Nomothetic research approaches want to uncover generalizable findings which explain objective phenomena. Idiographic research approaches want to look at individual cases in detail, to understand certain outcomes. 

Allport (1962), an influential researcher, stated that we cannot capture the uniqueness of an individual's personality with the use of statistical scores. Harre and Secord (1972) also criticized the focus on the manipulation of variables and the focus on quantification in psychological research. In their classic text "Human Inquiry", Reason and Roman (1981) advocated for a new paradigm for psychology. Also, Lincoln and Guba (1985) advocated for a 'naturalistic' paradigm, which means that one should search for detailed descriptions, so that the reality would be represented through the eyes of the research participants in their context.

These ideas were also adopted by the second-wave feminism in the 1960s and 1970s. Feminist psychology has looked for 'sex differences' in various domains in life and has found that women 'fall short'. To get more insight into women and their experiences, feminist psychologists used qualitative methods that had a phenomenological approach. Phenomenological approaches focus on obtaining detailed descriptions of experience as understood by those who have that experience. One feminist qualitative method that was developed was the 'voice relational method', in which the aim is to 'hear the voices' of people who have often been suppressed (such as those of adolescent girls). These kind of approaches are inductive: they start with data. Then, patterns in the data are labelled. This is also referred to as a 'bottom-up' approach. 

Critical stance on the construction of reality

From a social constructive perspective, the ways in which we understand the world and ourselves are formed by social processes, such as linguistic interactions. So, it states, that there is nothing fixed about these ways: hey all depend on particular cultural and historical contexts. This is called a 'relativist' stance. Reality is seen as dependent on the ways we come to know it. Relativism and social constructionism contrast with the ontology and epistemology of other approaches to qualitative research which tend to assume that there is some relationship between the outcome of the analysis of research data and the actualities (truths) of which the analysis speaks. 

To elaborate: many qualitative researchers acknowledge that the relationship between the analysis and the experiences are not direct. For example, when we want to qualitatively assess the experiences of men in expressing emotions, we know that some men may have forgotten some details or that they present themselves in a more positive way. Also, researchers are often aware of their own professional and personal influences on the research. This latter is called the interpretative framework. However, we still assume that there is some relationship between the analysis and the truth or reality. So, in conclusion: from a social construct perspective, data on emotions are not seen as reflecting reality about emotions. Instead, they are seen as accounts that construct emotions in particular ways and that use 'emotion talk' to perform social functions.

Reflexivity in qualitative research

A key feature of qualitative research is reflexivity. Reflexivity refers to that the researcher acknowledges that his or her interpretive framework (or speaking position) has played a role in their research. This is often seen as a 'contaminating' factor in quantitative research. However, the author states, when reflexivity is properly done, this leads to more transparency of the research process and helps the readers to understand the researcher's work.

A difficult thing when using reflexivity is that, when the rest of the research is written in a detached (non-personal) style, the use of 'I' in the reflections can be confusing. The author suggests that you should be consistent throughout your research report. This might mean that you make your whole research report more personal, or that you separate personal reflections from the rest of the text. 

Most academic journals do not publish articles with personal reflections. This is mostly due to that there are tight limits for articles. The author states that the consequence of this is that the articles are lacking an important contextual aspect and that readers are not in a good position anymore to understand and evaluate the research.

Evaluative criteria for qualitative research

The author explains how readers of qualitative research, can evaluate the worth of these studies. Quantitative studies are often evaluated in criteria such as reliability and internal and external validity. All these measures rely on objectivity and on limiting researcher 'bias'. 

In qualitative research, researcher bias is not limitable, because the researcher is always present in their research.

Elliott and colleagues (1999) developed seven evaluative criteria for quantitative as well as qualitative research, and seven criteria only for qualitative research. Reicher (2000) favoured looser evaluative schemes, such as that of Yardley (2000). Yardley stated that good qualitative research has: 'sensitivity to context', 'commitment and rigour', 'transparency and coherence' and 'impact and importance'. 'Sensitivity to context' means that a researcher should make the context of the theory clear (the socio-cultural setting of the study). 'Commitment' is defined as demonstrating prolonged engagement with the research topic and rigour is defined as the completeness of the data collection and analysis. 'Transparency' refers to that every aspect of the process of data collection is clear and detailed. 'Coherence' is defined as the 'fit' between the research question and the philosophical perspective that is adopted. 'Impact and importance' refer to the theoretical, practical and socio-cultural impact of the study.

The researcher can use the criteria that are most appropriate to the study and then justify their choice of criteria and allow readers to understanding this reasoning. 

There is also a criterion that has to do with the practical utility of qualitative research. The question is: "so what?". This means that a researcher should ask him or herself how his or her study contributes to science or society. 

Methodolatry and flexibility in qualitative research

Each of the set of steps that are mentioned in this book, are only a guidance or useful "road maps". It is important to remember that these maps are only one route to an analysis. You should not become too fixated on these routes: you should only view them as possible routes. With time and increasing experience, you can devise your own 'take' on conducting qualitative research and you might even develop methods for future research.

Combining research methods and approaches

In the last few years, there is often a mix of both qualitative and quantitative methods in the same research project. This is called a mixed-methods approach. The author states that this is a good thing, because it can provide richer research outcomes. The decision to use a mixed-methods approach should be made on the basis of the research questions. One could also use a 'pluralistic analysis', which refers to applying different qualitative methods with different ontologies and epistemologies. The aim of pluralistic analysis is to produce rich, multi-layered, multi-perspective readings of a data set, because by means of different 'ways of seeing'. 

Access: 
Public
Surrogate Science: The Idol of a Universal Method for Scientific Inference - Gigerenzer - 2015 - Article

Surrogate Science: The Idol of a Universal Method for Scientific Inference - Gigerenzer - 2015 - Article


The application of statistics to science is not a neutral act. Textbook writers in the social sceiences have transformed rivaling statistical systems into an apparently monolithic method that could be used mechaniscaly. No scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.

If statisticans agree on one thing it is that scientific inference should not be made mechanically. Good science requires both statistical tools and informed judgment about what model to construct, what hypotheses to test, and what tools to use. Many social scientist vote with their feet against an informed use of inferential statistics. A majority still computes p values, confindence intervals and a few calculate Bayes factors. Determining significance has become a surrogate for good research. This article is about the idol of a universal method of statistical inference.

Mindless statistical inference

In an internet study they asked participants if they felt a difference between heroism and altruism? The far majority felt so, and the authors computed a chi-squared test to find out whether the two numbers differed significantly. This is an illustration of the automatic use of statistical procedures, even when a statistical procedure really doesn't fit into the question. The idol of an automatic, universal method of inference, however, is not unique to p values or confidence intervals. It can also invade the Bayesian statistics.

The Idol of Universal Method of Inference

In this article they make three points:

  1. There is no universal method of scientific inference, but, rather a toolbox of useful statistical methods. In the absence of a universal method, its followers worship surrogate idols, such as significant p values. The gap between the ideal and its surrogate rested on the wrong ideas people have regarding statistical inference. For instance that a p value of 1% indicates that there is a 99% chance of replication.
  2. If the proclaimed 'Bayesian revolution' were to take place, the danger is that the idol of a universal method might survive in a new guise, proclaiming that all uncertainty can be reduced to subjective probabilities.
  3. Statistical methods are not simply applied to a discipline; they change the discipline itself and vice versa.

In science and everyday life, statistical methods have changed whatever they touched. The most dramatic change brought about by statistics was the 'probabilistic revolution'. In the natural science, the term statistical began to refer to the nature of theories, not the evaluation of data.

How statistics changed theories: the probabilistic revolution

The probabilistic revolution upset the ideal of determinism shared by most European thinkers. It differs from other revolutions because it didn't replace any systems in its own field. But it did upset theories in other fields outside of mathemetics. The social sciences inspired the probabilistic revolution in physics. But the social and medical sciences were reluctant to abandon the ideal of simple, deterministic causes. The social theorists hesitated to think of probability as more than an error term in the equation observation = true value + error.

The term inference revolution refers to a change in scientific method that was instutionalized in psychology and in other social sciences. The qualifier inference indicates that the inference of a sample to population grew to be considered the most crucial part of research.

To understand how deeply the inference revolution changed the social sciences, it is helpful to realize that routine statistical tests, such as calculations of p values or other inferential statistics, are not common in the natural sciences.

The first known test of a null hypothesis was by Arbuthnott and is strikingly similar to the 'null ritual' that was instutionalized in the social sciences. He observed that the external accidents which males are subject do make a great havock of them, and that this loss exceeds far that of the other Sex. The first null hypothesis test impressed no one, but this does not say that statistical methods have played no role in the social sciences. To summarize, statistical inference played little role and Bayesian inference virtually none in research before roughly 1940. Automatic inference was unknown before the inferental revolution with the exception of the use of critical ratio (the ratio of the obtained difference to its standard deviation).

The Null Ritual

The most prominent creation of a seemingly universal inference method is the null ritual:

  1. Set up a null hypothesis of 'no mean differences' or 'zero correlation'. Do not specify the predictions of your own research hypothesis.
  2. Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis. Report the result as p<.05>p<.01>, p.001, whichever comes next to the obtained p value.
  3. Always perform this procedure.

In psychology, this ritual became institutionalized in currricula, editorials and professional associations. But the null ritual does not exist in statistics proper. Also the null ritual is often confused with the Fisher's thoery of null hypothesis testing. For example, it has become common to use the term NHST (null hypothesis significance testing) without distinguishing between the two. But contrary to what is suggested by that misleading term, level of significance has three meanings: (a) a mere convention, (b) the alpha level, or (c) the exact level of significance.

The three meanings of significance

The alpha level is the long-term relative frequency of mistakenly rejecting the hypothesis H1 if it is true, also known as the Type 1 error rate. The beta level is the long-term relative frequency of mistakenly rejecting hypothesis H2 if it is true, also known as the type 2 error rate or power- 1.

  1. Set up two statistical hypotheses, H1 and H2 and decide on the alpha, beta and sample size before the experiment.
  2. If the data falls into the rejection region of H1, accept H2; otherwise accept H1.
  3. The usefulness of this procedure is limited among other situations were there is a conjunction of hypothese, where there is repeated sampling.

Fisher eventually refined his earlier position. The result was that a third definition of level of significance, alongside convention and alpha level.

  1. Set up a statistcal null hypothesis. The null need not be a nil hypothesis.
  2. Report the exact level of significance, do not use a conventional 5% level all the time.
  3. Use this procedure only if you know little about the problem at hand.

The procedure of Fisher differs fundamentally from the null ritual. First, one should not automatically use the same level of significance, and second, one should not use this procedure for all problems. Step one of the ritual does contain the misinterpretation that null means 'nil' such as zero difference.

The problem of conflicting methods

When writers learned about Neyman-Pearson these writers had a problem; how should they deal with conflicting methods? The solution would have been to present a sort of toolbox of different approaches, but Guilford and Nunnally mixed the concepts and presented the muddle as a single, universal method. The idol of this universal method also left no place for Bayesian statistics.

Bayesianism and the new quest for a universal method

Fisher, Neyman and Pearson also have been victims of social scientists' desire for a single tool, a desire that produced a surrogate number for inferring what is good research. The potential danger of the Bayesian statistics lies in the subjective interpretation of probability, which sanctions its universal application to all situations of uncertainty.

The 'Bayesian revolution' had a slow start. To begin with, Bayes' paper was eventually published, but it was largely ignored by all scientists. Just as the null ritual had replaced the three interpretations of level of significance with one, the currently dominant version of Bayesianism does the same with the Bayesian pluralism, promoting a universal subjective interpretation instead. Probability was:

  1. a relative frequency in the long run, such as in mortality tables used for calculating insurance premiums
  2. a propensity, that is, the physical design of an object, such as that of a dice or a billiard table
  3. a reasonable degree of subjective belief, such as in the attempts of courts to quantify the reliability of witness testimony.

In the essay of Bayes, his notion of probability is ambiguous and can be read in all three ways. With this ambiguity, however, is typical for his time in which the classical theory of probability reigned.

If probability is thought of as a relative frequency in the long run, it immediately becomes clear that Bayes' rule has a limited range of applications. Knight (economist) used the term risk for these two situations (i.e. probabilities that can be reliable measured in terms of frequency or propensity) as opposed to uncertainty. Subjective probability can be applied to situations of uncertainty and to singular events, such as the probability that Michael Jackson is still alive. There is now a new generation of Bayesians who believe that Bayesianism is the only game in town. They use the term Universal Bayes for the view that all uncertainties can or should be represented by subjective probabilities, that explicitly rejects the idea of Knight regarding the distinction between risk and uncertainty.

Risk versus uncertainty

What the universal Bayesians do not seem to realize is that in a theory Bayesianism can be optimal in a world of risk, but is of uncertain value when not all information is known or can be known or when probabilities have to be estiated form small, unreliable samples. One can also use plain common sense to see that complex optimization algorithms are unreliable in an uncertain world.

The automatic Bayes

As with the null ritual, the universal claim for Bayes' rule tends to go together with the automatic use. One version of the automatic Bayes has to do with the interpretation of the Bayes factors using the Jeffrey's scale. A second version of Automatic Bayes can be found in the heuristic-and-biases research program, that is widely taught in business education courses. But, in short, the automatic use of Bayes'rule is a dangerously beautiful idol. But Bayesianism is not reality, Bayesianism can't exist in the singular.

The statistical toolbox

The view of this article states that an alternative to these approaches is to think of the Universal and Automatic Bayes as forming a part of a larger toolbox. In this toolbox, the Bayes' rules has its value, but like any other tool, does not work for all problems.

How to change statistics?

Leibniz had a dream: to discover the calculus that could map all ideas into symbols. Such a universal calculus would also put an end to all scholarly bickering. But, nonetheless, this dream of Leibniz is still alive in social sciences today. The idea of surrogate science; from the mindless calculation of p values or Bayes factors to citation counts, is not entirely worthless. It fuels a steady stream of work of average quality and keeps researchers busy producing more of the same. But it also makes it harder for scientists to be innovative, risk taking and imaginative. Therefor surrogates also encourage cheating and incomplete or dishonest reporting. Would a Bayesian revolution lead to a better world? The answer depends on what the revolution might be. The real challenge here is to prevent the surrogates from taking over once again, such as when replacing routine significance tests with routine interpretations of the Bayes factors. So, Leibniz's beautiful dream of a universal calculus could easily turn into Bayes' nightmare.

Access: 
Public
Access: 
Public
This content is related to:
Samenvatting bij Experimental and Quasi-Experimental Designs for Generalized Causal Inference van Cook, Campbell & Shadish (2nd edition, 2001)
Inferential Statistics, Howell Chapter 4-8,18
Check more of this topic?
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Image

Comments, Compliments & Kudos:

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
Check how to use summaries on WorldSupporter.org


Online access to all summaries, study notes en practice exams

Using and finding summaries, study notes en practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Starting Pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
  2. Use the menu above every page to go to one of the main starting pages
  3. Tags & Taxonomy: gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
  4. Follow authors or (study) organizations: by following individual users, authors and your study organizations you are likely to discover more relevant study materials.
  5. Search tool : 'quick & dirty'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject. The search tool is also available at the bottom of most pages

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study (main tags and taxonomy terms)

Field of study

Check related topics:
Activities abroad, studies and working fields
Institutions and organizations
Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private
Statistics
1285