Summaries per article with Research Methods: theory and ethics at University of Groningen 20/21

Check summaries and supporting content in full:

False-positive psychology: Undiscovered flexibility in data collection and analysis allows presenting anything as significant - Simmons et al. - 2011 - Article

Introduction
Solutions for authors
Guidelines for reviewers
Conclusion

Introduction

A false positive is likely the most costly error that can be made in science. A false positive is the incorrect rejection of a null hypothesis.

Despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not.

Many researchers often stop collecting data on the basis of interim data analysis. Many researchers seem to believe that this practice exerts no more than a trivial influence on the false-positive rates.

Solutions for authors

The authors of this article offer six requiremets for authors as a solution to the problem of false-positive publications:

Before the collection of data begins, authors must decide the rule for terminating data collection and they should report this rule in the article.
At least 20 observations per cell must be collected by the author or else the author should provide a compelling cost-of-data-collection justification.
All variables collected in a study must be listed.
All experimental conditions must be reported, including failed manipulations.
If observations are eliminate, authors must also report what the statistical results are if those observations are included.
Authors must report the statistical results of the analysis without the covariate, if an analysis includes a covariate.

Guidelines for reviewers

The authors of this article also offer four guidelines for reviewers:

Reviewers must make sure that authors follow the requirements.
Reviewers should be more tolerant of imperfections in results.
Reviewers must make possible that authors are able to demonstrate that their results do not hinge on arbitrary analytic decisions.
Reviewers should require the authors to conduct an exact replication, if justifications of data collection or analysis are not compelling.

Conclusion

The solution offered does not go far enough in the sense that it does not lead to the disclosure of all degrees of freedom. It cannot reveal those arising from reporting only experiments that ‘work’ (i.e., the file-drawer problem).

The solution offered goes too far in the sense that it might prevent researchers from conducting exploratory research. This does not have to be the case if researchers are required to report exploratory research as exploratory research. This also does not have to be the case if researchers are required to complement it with confirmatory research consisting of exact replications of the design and analysis that ‘worked’ in the exploratory phase.

The authors considered a number of alternative ways to address the problem of reasearcher degrees of freedom. The following are considered and rejected:

Correcting the alpha levels. A researched could consider adjusting the critical alpha level as a function of the number of researcher degrees of freedom employed in each study.
Using Bayesian statistics. This approach has many virtues, it actually increases researcher degrees of freedom by offering new set of analyses and by requiring to make additional judgments on a case-by-case basis.
Conceptual replications. They are misleading as a solution to the problem at hand, because they do not bind researchers to make the same analytic decisions across studies.
Posting materials and data. This would impose too high a cost on readers and reviewers to examine the credibility of a particular claim.

The goals of researchers is to discover the truth, and not to publish as many articles as they can. For different reasons researchers could lose sight of this goal.

Access:

Public

1769 reads

Why Summaries of Research on Psychological Theories Are Often Uninterpretable – Meehl - 1990 - Article

What is the scientific method?
What is the issue in psychological research?
What is Meehl’s thesis?
The derivation of observational conditional
Meehl’s ten obfuscating factors

What is the scientific method?

The scientific method is an empirical and sceptical approach to asking questions and obtaining answers.

Good habits for a researcher include enthusiasm, open-mindedness, common sense, role-taking ability, inventiveness, confidence in own judgement, consistency and care for details, ability to communicate, and being honest.

What is the issue in psychological research?

Evidence shows a deficiency in most social sciences to match the growth and theoretical integration that characterizes the history of more successful scientific disciplines. Meehl presupposes that, with certain exceptions, theories in ‘soft areas’ of psychology tend to go through periods of initial enthusiasm leading to large amounts of empirical investigation with ambiguous overall results. This is followed by various kinds of amendment and the generation of ad hoc hypotheses. In the long run, experiments lose interest instead of theories actually being falsified. “They never die; they just slowly fade away.”

This discussion constitutes research that shares three properties: (a) theories in ‘soft areas’, (b) data correlational, and (c) positive findings consisting of refuting the null hypothesis. Soft areas could include counselling, personality theory, and social psychology.

What is Meehl’s thesis?

“Null hypothesis testing of correlational predictions from weak substantive theories in soft psychology is subject to the influence of ten obfuscating factors whose effects are usually (1) sizeable, (2) opposed, (3) variable, and (4) unknown. The net effect of these ten obfuscating influences is that the usual research literature review is almost uninterpretable.”

The derivation of observational conditional

Theories, taken with a statement of conditions, entail relations between single observations (particulars). The derivation of a prediction about observational facts involves a conjunction of several premises.

Derivation of observational conditional: T – A1 – A2 – Cp – Cn à (O1 É O2)
- T: substantive theory of interest
- A1, A2…: one or more auxiliary theories which aren’t the main focus of the investigator’s interest.
- Cp: ceteris paribus clause, a negative statement not formulated with concrete content like an auxiliary but states ‘other things being equal’.
- Cn: statements about the experimental ‘conditions’.
- O1 & O2: observational statements.

Meehl’s ten obfuscating factors

The combined operation of the ten factors makes it impossible to tell what a score of statistical significance in research proves about a theory’s plausibility.

1. Loose derivation chain

There are very few tight derivation chains in most areas of psychology and almost none in soft psychology. Some parts of the chain are well explained, while others are vague or not explicitly stated. Meehl states that if your observation 2 doesn’t follow observation 1 as expected, you won’t know why it happened and which part of the derivation chain was problematic.

2. Problematic auxiliary theories

Auxiliary theories are explicitly stated. Sometimes it happens in soft psychology that each auxiliary theory is almost as problematic as the main theory we’re testing. When there are multiple problematic auxiliary theories, the joint probability that they have could be lower than the prior probability of the theory of interest. We intended to investigate T, but the logical structure leaves us not knowing whether the prediction failed because our T was false or because one or more of the conjoined statements (A1, A2, Cp, or Cn) were false. – this reasoning applies to sections 3 and 4 as well.

3. Problematic ceteris paribus clause

Ceteris paribus means ‘everything else being equal’, but everything else isn’t always equal. This clause means that when we test a theory by showing a statistical relationship, while individuals may vary in certain factors that we have not controlled but allowed to vary, the alleged causal influence doesn’t have a significant additional effect operating in a direction opposed to our theory. There is usually a problem of control in the study. There are some situations where we can’t or don’t know how to control for certain factors, which could to some extent counteract the induced state we’re trying to test.

4. Experimenter error

Experimenter mistakes in manipulation. Includes errors due to biases, expectancies etc. on the part of the experimenter (or others involved on the input side of a testing process like research assistants). The experimenter may be enthusiastic about their theory and, without consciously intending, slant results into their favour, e.g. by the way they interact with their subjects/keenness.

5. Inadequate statistical power

Most tests in psychology can be expected to fail because of insufficient statistical power (usually due to small sample sizes) to detect a ‘real effect’.

6. Crud factor

In social sciences, everything correlates to some extent with everything else. Any measured trait is some function of a list of causal factors in the genes and history of the individual. Genetic and environmental factors are also known from empirical research to be correlated.

7. Pilot study

A pilot study essentially duplicates a main study. A true pilot study is a mini version of a main study with a few minor improvements perhaps. Pilot studies have two aims, firstly they try to see whether an effect exists or not (to then decide whether to pursue the research to a larger extent). Secondly, they try to get a rough idea of the relationship between a mean difference and the approximate variability as a basis for inferring the number of cases that would be necessary to achieve statistical significance. If nothing significant is found, the pilot will be discarded and not published (because no effect = evidence against a theory). This means there’s a large collection of data missing from the collective understanding of the human psyche.

8. Selective bias in submitting reports

A researcher is more likely to submit a study if significant results were found. This comes partly from thinking that a null result ‘doesn’t prove much’. This creates bias in the literature available to the population.

9. Selective editorial bias

People who have been editors or referees say the same thing as investigators: they’re somewhat more inclined to publish a clear finding of a refuted null hypothesis than one that simply fails to show a trend. There’s a failure to report everything. Non-significant results are deemed to say that ‘there’s nothing here’.

10. Detached validation claim for psychometric instruments

Testing should be done with valid instruments. Sometimes, theories and instruments are validated at the same time. This isn’t good practice because you should establish validity of an instrument before you use it.

Access:

Public

1861 reads

What has happened down here is the winds have changed – Gelman - 2016 - Article

What is the article about?
What is research incumbency?
How did we get here? (a timeline of important events)
Who is Susan Fiske?
What is the goal?

What is the article about?

Andrew Gelman, the author, responds to Susan Fiske’s article in the APS observer (Mob rule or Wisdom of crowds?) and the current replication crisis. He mostly refers to how her attitudes, can be understood in light of the recent history of psychology and its replication crisis. He suggests that the replication crisis has redrawn the topography of science, especially in social psychology and that some people find these changes to feel catastrophic, like Fiske.

What is research incumbency?

Fiske does not like it when people use social media to publish negative comments on published research. She follows what Gelman calls the research incumbency rule: once an article is published in some approved venue, it should be taken as truth.

Problems with research incumbency: (a) many published papers are in error (e.g. unsuccessful replication), and (b) statistical error draws the line between published and unpublished work.

How did we get here? (a timeline of important events)

To understand Fiske’s attitude, it helps to realize how fast things have changed. In 2011, the replication crisis was barely a cloud on the horizon.

1960s-1970s: Paul Meehl argues that the standard paradigm of experimental psychology does not work. A clever investigator can slowly work is way through a tenuous nomological network, performing many related experiments appearing to the uncritical reader as a fine example of an integrated research program, without once ever refuting or corroborating a single strand of the network.

1960s: Jacob Cohen studies statistical power, arguing that design and data collection are central to good research in psychology à his book: Statistical Power Analysis for the Behavioural Sciences.

1971: Tversky and Kahneman write ‘Belief in the law of small numbers’ focusing on persistent biases in human cognition and researchers’ misunderstanding of uncertainty and variation.

1980s-1990s: Null hypothesis significance testing becomes more and more controversial in psychology.

2006: Satoshi Kanazawa, sociologist, publishes a series of papers with provocative claims (e.g. engineers have more sons, nurses have more daughters etc.) that all turn out to be based on some statistical error. Realization that similar research programs are dead on arrival because of too low signal-to-noise ratio.

2008: Edward Vul, Christine Harris, Piotr Winkielman, and Harold Pashler write a controversial article (voodoo correlations in social neuroscience) arguing that statistical problems are distorting the research field and that many prominent published claims can’t be trusted.

Also in 2008: Neuroskeptic: a blog starting to criticize soft targets, then science hype, and then moves to larger criticism of the field.

2011:

Joseph Simmons, Leif Nelson, and Uri Simonsohn publish a paper, “False-positive psychology” introducing the terms ‘researcher degrees of freedom’ and ‘p-hacking’.
Daryl Bem publishes his article, “Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect”. Had obvious multiple comparisons problem. Earlier work seemed to fit into this larger pattern, that certain methodological flaws in standard statistical practice were not just isolated mistakes. Bem prompted the realization that bad work could be the rule, not the exception.
Various cases of scientific misconduct hit the news. Diederik Stapel is kicked out of Tilburg University and Marc Hauser leaves Harvard. Brings attention to the Retraction Watch blog.
- Researchers who are true believers of their hypotheses, which in turn are vague enough to support any evidence thrown at them à Clarke’s Law.

2012: Gregory Francis publishes “Too good to be true” – arguing that repeated statistically significant results can be a sign of selection bias.

2014: Katherine Button, John Ioannidis, Claire Mokrysz, Brian Nosek, Jonathan Flint, Emma Robinson, and Marcus Munafo publish “Power failure, why small sample size undermines the reliability of neuroscience”. – Closes the loop from Cohen’s power analysis to Meehl’s general despair – connecting selection and overestimates of effect sizes.

2015: “Power pose” research from Dana Carney, Amy Cuddy, and Any Yap, receiving adoring media coverage but suffered from the now-familiar problems of uncontrolled researcher degrees of freedom and failed to be replicated. Prestigious PPNAS (Proceedings of the National Academy of Sciences) also published flawed papers.

2016: Brian Nosek and others organize a large collaborative replication project. Lots of prominent studies don’t replicate.

Barely anything happened for a long time, and even after the first revelations people could still ignore the crisis. Then all of a sudden everything changed. If you were deeply invested in the old system, it would be hard to accept change.

Who is Susan Fiske?

She is the editor of the PPNAS articles. She has had issues with her own published work. There were many data errors and when pointed out (Fiske and colleagues), refused to consider anything. Their theory was so open-ended that it could explain almost any result. Authors claimed that fixing the errors wouldn’t “change the conclusion of the paper”.

The problem is that Fiske is working within a dead paradigm. The paradigm of open-ended theory, of publication in top journals and promotion in the popular and business press, based on ‘p less than 0.05’ results obtained during using abundant degrees of freedom.

What is the goal?

The goal is to do good science. It is hard to do so when mistakes aren’t getting flagged and you’re supposed to act like you’ve been right all along, that any data pattern you see is consistent with theory etc. It’s an issue for authors of original work, for researchers following up on incorrect work, and other researchers who want to do careful work but find it hard to compete in a busy publishing environment with authors of flashy, sloppy exercises in noise mining.

When statistical design analysis shows that this research is impossible, or when replication failures show that published conclusions were wrong, then it’s expected that you move forward, not that you keep doing the same thing, insisting you were always right. It’s an inefficient way to do science, for individual researchers to devote their careers to dead ends because they refuse to admit error.

Access:

Public

1375 reads

Estimating the reproducibility of psychological science – Open Science Collaboration - 2015 - Article

Introduction
What is replication?
Method & Results
Does successful replication mean theoretical understanding is correct?
Does failure to replicate mean the original evidence was a false positive?
How is reproducibility influenced by publication bias?
Conclusion

Introduction

Reproducibility is a core principle of scientific progress. Scientific claims should get credibility by the replicability of their evidence instead of the status of their originator. Debates about transparency of methodology and claims are meaningless if the evidence being debated is not reproducible.

What is replication?

Direct replication tries to recreate conditions believed necessary to obtain a previously observed finding and is a way to establish reproducibility of a finding with new data. A direct replication may not achieve the original result for many reasons:

Known or unknown differences between the replication and original study could moderate the size of an observed effect.
Original result could have been a false positive.
The replication could produce a false negative.

False positives and false negatives provide misleading information about effects, and failure to identify the necessary conditions to reproduce a finding shows and incomplete theoretical understanding. Direct replication provides the chance to assess and improve reproducibility.

Method & Results

The authors came up with a protocol for selecting and conducting high-quality replications. Collaborators joined the project, selected study for replication from available studies in the sampling frame, and were guided through the replication process.

The Open Science Collaboration conducted 100 replications (270 contributing authors) of studies published in three psychological journals using high-powered designs and original materials when possible. Through consulting original authors, obtaining original materials, and internal review, replications kept high fidelity to the original designs.

Here, they chose to evaluate replication success using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size of the replication effects was half the size of the mean effect size of the original effects, substantial decrease.

97% of original studies had significant results (P
36% of replications had significant results.
47% of original effect sizes were in the 95% confidence interval of the replication effect size.
39% of effects were subjectively rated to have replicated the original result.
Assuming no bias in original results, combining original and replication results left 68% with statistical significant effects.

Tests suggest that replication success was better predicted by the strength of the original evidence than by characteristics of the original and replication teams.

Does successful replication mean theoretical understanding is correct?

It would be too easy to conclude that. Direct replication mainly provides evidence for the reliability of a result. Understanding is achieved through many, diverse investigations that give converging support for a theoretical interpretation and rules out alternative explanation.

Does failure to replicate mean the original evidence was a false positive?

It would also be too easy to conclude this. Replications can fail if the methodology differs from the original ways that interfere with observing the effect. The Open Science Collaboration conducted replications designed to minimize a priori reasons to expect different results by using original materials, contacting original authors for review of designs, and conducting internal reviews. However, unanticipated factors in the sample, setting, or procedure could have altered the observed effect sizes.

How is reproducibility influenced by publication bias?

There are indications of cultural practices in scientific communication that could be responsible for the observed results. The combination of low power research designs and publication bias produce a literature with upwardly biased effect sizes. This anticipates that replication effect sizes would regularly be smaller than the original studies. This isn’t because of differences in implementation but because the original effect sizes are influenced by publication/reporting bias and the replications aren’t.

Consistent with this expectation, most replication effects were smaller than the originals, and reproducibility success correlated with indicators of the strength of the initial evidence, like lower original P values and larger effect sizes. This suggests that publication, selection, and report biases are possible explanations for the difference in original and replication effects. The replication studies reduced these biases because replication preregistration and pre-analysis plans ensured confirmatory tests and reporting of all results.

Furthermore, repeated replication efforts that fail to identify conditions of the original finding can be observed reliably and could reduce confidence in the original finding.

Conclusion

The five indicators examined to describe replication success are not the only ways to evaluate reproducibility. But the results offer a clear, collective conclusion: a large portion of replications produced weaker evidence for original findings despite using materials provided by the original authors, internal reviews, and high statistical power to detect the original effect sizes.

Correlational evidence is consistent with the conclusion that variation in the strength of initial evidence was more predictive of replication success than variation in the characteristics of the teams conducting the research.

Reproducibility is not well understood because of the incentives for individual scientists to prioritize novelty over replication. Innovation is an engine of discovery and is vital for a productive, effective scientific interpose. But journal reviewers and editors might dismiss a new test of a published idea as unoriginal. Innovation shows that paths are possible, replication shows that paths are likely, progress relies on both. Scientific progress is a process of uncertainty reduction that can only succeed if science remains sceptical of its explanatory claims.

Access:

Public

1476 reads

Making replication mainstream – Zwaan et al. - 2018 - Article

Why are replications important?
What is this review’s purpose?
What is some background information on this topic?
What are issues concerning replicability?
Which types of replication studies exist?
What are some concerns about replicability?
Summary and conclusions

Why are replications important?

Being able to effectively replicate research findings is important for scientific progress. What defines science is that researchers don’t accept claims without critical evaluation of the evidence. Part of this evaluation process is the independent replication of findings. But the value of replication as a normal feature of psychology is a recently controversial topic.

Replications are important to falsify hypotheses. Lakatos’ notion of sophisticated falsificationism – an auxiliary hypothesis can be formed, allowing the expanded theory to accommodate the troublesome result. If more falsifications arise, more auxiliary hypotheses must be formed to account for unsupported predictions, problems begin to pile for a theory – this is degenerative. If auxiliary hypotheses are empirically successful, the program has better explanatory power – this is progressive. Replications are a tool to distinguish between progressive and degenerative research,

What is this review’s purpose?

This paper aims to educate readers on the value of replications and integrate recent discussions about them to give a foundation for future replication efforts. The authors hope that this will make replication studies more regular in research that should increase the authenticity of findings.

What is some background information on this topic?

Popper said that an ‘effect’ that’s found once but can’t be reproduced does not qualify as a scientific discovery, it’s chimeric (hoped for but impossible to achieve). There are two important insights that inform scientific thinking:

A finding needs to be repeatable to count as a scientific discovery.
Research needs to be reported in a way that others can reproduce the procedures.

Therefore, scientific discovery needs a consistent effect and a comprehensive description of the method used to produce the result in the first place. However, this does not imply that all replications are expected to be successful or that no expertise is required to conduct them.

What are issues concerning replicability?

Concerns about the replicability of findings exist in various disciplines. Problems with replicability and false positives can happen for many reasons:

Publication bias – process in which research findings are chosen based on how much support they give for a hypothesis.
Growing body of meta-scientific research showing the effects of researcher degrees of freedom (latitude in how research is conducted, analyzed, and reported). Existence of researcher degrees of freedom allow investigators to try different analytic options until they find a combination that gives a significant result, especially when there is pressure to publish significant findings.
- Confirmation bias: can convince investigators that the procedures that led to the significant result were the ‘best’ approach in the first place.
HARKing – hypothesizing after the results are known.

Researcher degrees of freedom and publication bias favouring statistically significant results have produced overestimations of effect sizes in literature. Replication is important to provide more accurate estimates of effect size.

Which types of replication studies exist?

Replication studies serve many purposes, and those objectives determine how a study is designed and interpreted. Schmidt (2009) identified five functions:

Address sampling error (i.e. false-positive detection)
Control for artifacts
Address researcher fraud
Test generalizations to different populations
Test the same hypothesis of a previous study using a different procedure

Multiple definitions have been proposed for direct and conceptual replications.

Direct replication: study that tries to recreate the critical elements (samples, procedures, measures) of an original study. It does not have to duplicate all aspects but only elements believed to be necessary for producing the original effect. Useful for reducing false positives.
- Theoretical commitment: researchers should agree on what those critical elements are.
Conceptual replication: study where there are changes to the original procedures that make a difference regarding the observed effect. Its designed to test if an effect extends to different populations given theoretical reasons to assume it will be weaker or stronger for different groups.

What are some concerns about replicability?

The interpretation of replications has produced disagreement and controversy. Here we consider the most frequent concerns.

Concern 1: Context is too variable

Perhaps the most voiced concern about direct replications is that the conditions under which an effect is initially observed may not hold when a replication attempt is performed – change in context.

Proponents suggest it is too hard to specify all the contextual factors and its extremely difficult for independent researchers to recreate these conditions with precision. Consequently, it is never possible to determine whether a ‘failed’ replication is due to the original demonstration being a false-positive or because whether context has changed so much to wipe out the effect.

Response C1

Context changes can and should be considered as a possible explanation for why a replication failed to obtain similar results to the original. Inn psychology it is rare that context does not matter or does not play a role in the outcome.

Nevertheless, the post hoc reliance on context sensitivity as an explanation for failed replication attempts is problematic for science. The fact that contextual factors vary between studies means that post hoc, context-based explanations are always possible to generate. Reliance on context sensitivity as an explanation, without committing to collecting new empirical evidence to test the idea, makes the original theory unfalsifiable – degenerative research.

An uncritical acceptance of these post hoc explanations ignores the possibility that false positives ever existed. The post hoc consideration of differences in features should lead to new testable hypotheses rather than dismissals of replication results.

Two strategies for solving the concerns outlined in this section are to (1) raise standards in reporting of experimental detail, so that the original papers include replication recipes, and (2) find ways to encourage original authors to identify potential boundaries/caveats in their research.

Concern 2: The theoretical value of direct replication is limited

Many arguments against replications agree on a general claim that direct replications aren’t necessary because they either have limited informational value or are misleading. The concern is that direct replications provide a false sense of certainty about the robustness of an underlying idea. Furthermore, replications can be unreliable just like original studies can, meaning that one can be sceptical about the value of any individual replication study.

Response C2

This concern implies that neither successful nor failed direct replications make novel contributions to theory. Some find it unworthy to do work that doesn’t advance a theory. But repeatedly showing that a theoretically predicted effect isn’t empirically supported adds knowledge to the field. Research that leads to identifying moderators and boundary conditions adds knowledge.

Direct replications are also necessary if researchers want to further explore a finding that emerged in exploratory research (e.g. a pilot study).

Procedures that can be used to accomplish some aims of direct replications includes preregistration. It can reduce/prevent researcher degrees of freedom, consequently reducing false positives. In preregistration, a researcher details a study design and analysis on a website before data is collected. Public preregistration can at least reduce publication bias.

Concern 3: Direct replications aren’t feasible in certain domains

It’s argued that replications may not be desirable or possible because of practical concerns. Certain studies may capitalize on extremely are events like a natural disaster or astronomical event, replications to test the effects of these events are impossible. So if being able to replicate a finding is what makes something ‘scientific’, then a lot of research would be excluded. Some topics are privileged as more scientific/rigorous than others (caste system). Replication studies are also more common in areas where studies are easier to conduct (e.g. universities) – bias.

Response C3

Just because replications aren’t always possible doesn’t undermine their value. Researchers working in areas where replication is difficult should be alert to such concerns and make efforts to avoid the resulting problems.

Costs of replication will be borne by researchers in that area. However, the goal of replication isn’t to provide the robustness of a certain field but instead to test the robustness of a certain effect.

Concern 4: Replication are a distraction

This concern holds the view that problems existing in the field may be so severe that attempts to replicate studies that currently exist will be a waste of time and could distract from bigger problems faced by psychology.

Related argument: the main problem in the accumulation of scientific knowledge is publication bias. Failed replications exist but aren’t being published. Once the omission of these studies is addressed, meta-analyses won’t be compromised and will provide an efficient means to identify the most reliable findings in the field. The idea is that even if replication studies tell us something useful, there are more efficient strategies to improve the field that have fewer negative consequences.

Response C4

Direct replications have unique benefits. It’s clear that failures to replicate past research findings have received most attention, but large-scale successful replications also have rhetorical power, showing that the field is capable of producing robust findings on which future work can build.

Replications also provide a simple metric by which we can evaluate the extent of the problem and the degree to which certain solutions work.

Concern 5: Replications affect reputations

Some debates about replication studies concern the reputation effects of them. Authors of failed replications could face questions of competency and feel victimized. Replications also create these concerns for the replicators who deserve credit for their effort in addressing the robustness of the original finding. Some argue that the replication crisis has created a career niche for bad experimenters’.

Another reputational concern comes from the fact that several of the most visible replication projects to date have involved large groups of researchers. How does one determine the contributions of and assign credits to authors of a multi-authored article?

Response C5

Replicators should go out of their way to carefully, objectively, and without exaggeration describe their results and implications of the original work. It can be useful for replicators and original authors to have contact, and in some cases collaboration. That is, a cooperative effort that is taken by two investigators who hold different views on an empirical question.

As more replications are conducted, the experience of having a study fail to replicate will become more normative, and hopefully less unpleasant.

Concern 6: There is no standard method to evaluate replication results

This is concerned with the interpretation of replication results. Two researchers can look at the same study and come to different conclusions regarding the original effects successful duplication. So what’s the point of running replication studies if the field can’t agree on which ones are successful?

Response C6

There’s growing consensus on which analyses are most likely to give reasonable answers to the question of whether a replication provides results consistent with those from an original study – frequentist estimation and Bayesian hypothesis testing. Investigators should consider multiple approaches and pre-registering analytic plans. Two approaches are especially promising:

Small telescopes approach: focus on interpreting confidence intervals from the replication. The idea is to consider what effect size the original study would have 33% power to detect and then use this value as a benchmark for replication. If the 90% CI from the replication excludes this value, then we say the original study couldn’t have meaningfully examined this effect. Focus on the design of the original study.
Replication Bayes factor approach: a number representing the amount by which new data (replication) shift the balance of evidence between two hypotheses. The extent of the shift depends on how accurately the competing hypotheses predict observed data. In a replication, the researcher compares statistical hypotheses that map to (1) a hypothetical optimistic theoretical proponent of the original effect (predicting replication effect size to be away from 0) and (2) a hypothetical sceptic who thinks the original effect doesn’t exist (predicts replication effect size to be close to 0). The replication Bayes factor can compare the accuracy of these predictions.

Summary and conclusions

Repeatability is essential to science. Arguably, a finding isn’t scientifically meaningful until it can be replicated with the same procedures that produced it initially. Direct replication is the mechanism by which replicability is assessed and a tool for distinguishing progressive from degenerative research.

A Nature survey reported that 52% of scientists believe their field as a significant replication crisis, 38% believe it’s a slight crisis. 70% of researchers have tried and failed to reproduce another scientist’s findings.

Many concerns have been raised:

When replications should be expected to fail.
What informational value they provide to a field that hopes to push a theory forward.
The fairness and reputational consequences of replications.
The difficulty in deciding when a replication has succeeded or failed.

Replication helps clarify which findings in the field we should be confident in as we move forward.

Access:

Public

1775 reads

The preregistration revolution – Nosek et al. - 2018 - Article

What is the aim of scientific progress?
What is prediction and postdiction?
Why is this distinction important?
What are some mental constraints (biases) in distinguishing predictions and postdictions?
How do novel findings lead to postdictions?
How do standard tools of statistical inference assume prediction?
What is the garden of forking paths?
How can preregistration distinguish prediction and postdiction?
What are some challenges in preregistration?
How can preregistration become the norm?
How can means be advanced for preregistration?
How could motive be advanced for preregistration?
How could opportunity be advanced for preregistration?
Conclusion

What is the aim of scientific progress?

Scientific progress is characterized by reducing uncertainty about nature. Models explaining prior and predicting future observations are constantly improved by reducing the prediction error. When the prediction error decreases, certainty about what will happen increases.

What is prediction and postdiction?

Scientists improve models by generating hypotheses based on existing observations, and testing them. We use general terms to capture the distinction – postdiction and prediction.

Prediction: data used to confront the chance a hypothesis/prediction is wrong. Uses data to generate hypothesis about why something happened. Assess uncertainty by observing how predictions account for new data.
Postdiction: data is known and hypothesis generated to explain why they occurred. Gathers data to test ideas about what will happen. Vital for discoveries of possibilities not yet considered.

Why is this distinction important?

Not knowing the difference can lead to overconfidence in post hoc explanations (postdictions) and increase the likelihood of believing that there is evidence for a finding when in fact there is no evidence. presenting postdictions as predictions can increase the attractiveness/publishability of findings by falsely reducing uncertainty – decreasing reproducibility.

Mistaking postdiction as prediction underestimates the uncertainty of outcomes, producing psychological overconfidence in resulting findings.

What are some mental constraints (biases) in distinguishing predictions and postdictions?

The dynamism of research and limits of human reasoning make it easy to mistake postdiction as prediction.

Circular reasoning – generating a hypothesis based on observed data, then evaluating the validity of the hypothesis based on the same data.
Hindsight bias (I-knew-it-all-along effect) – tending to see outcomes as more predictable after knowing data compared with before. Thinking you would have anticipated that explanation in advance. Common case – researcher’s prediction is vague enough that many outcomes can be rationalized.

How do novel findings lead to postdictions?

Novel (new findings), positive (finding an effect), clean (providing strong narrative) results are better for reward and for launching science into new domains of inquiry. They lead to postdictions because scientists are motivated to explain after the fact. If certain results are more rewarded than others, researchers are motivated to get results that are likely to be rewarded regardless of accuracy.

Lack of clarity between postdiction and prediction gives the chance to select, rationalize, and report tests that prioritize reward over accuracy.

How do standard tools of statistical inference assume prediction?

Null hypothesis significance testing (NHST) is made to test hypotheses (prediction). The prevalence of NHST and the P value implies that most research is prediction or that postdiction is often mistaken as prediction.

What is the garden of forking paths?

There are a many choices for analyzing data that can be made. If they’re made during analysis, observing data may make some paths more likely than others. In the end, it could be impossible to guess the paths that could’ve been chosen if the data were different or whether analytic decisions were influenced by certain biases.

How can preregistration distinguish prediction and postdiction?

Preregistration is committing to analytic steps without previous knowledge of research outcomes. Prediction is achieved because test selection isn’t influenced by observed data. Preregistration constrains how the data will be used to confront research questions.

Inferences from preregistered analysis should be more replicable than those not preregistered – analysis choices can’t be influenced by motivation, memory, or reasoning biases.
Preregistration can draw a line between pre-and-postdiction, preserving diagnosticity of NHST inference for prediction and clarifies the role of postdiction in generating possible explanations to test as future predictions.

Preregistration doesn’t favour prediction over postdiction, but aims to clarify which is which. In an ideal world preregistration would look like this: observation about the world > research questions > analysis plan > confront hypothesis > explore data for potential discoveries generating hypotheses after the fact > interesting postdictions converted to predictions or designing the next study and cycle repeats.

This ideal model is a simplification of how most research actually happens.

What are some challenges in preregistration?

Challenge 1: Changes to Procedure During Study Administration

Good plans can still be hard to achieve. Example: Jolene preregisters an experimental design using infants as participants. She plans to collect 100 observations but only gets 60. Some infants also fall asleep during the study, which she didn’t anticipate. Her preregistration analysis does not exclude sleeping babies. She can document changes to her preregistration without diminishing diagnosticity, and be transparent in reporting changes and explaining why. Most of the plan is still preserved and most deviations are transparently reported making it possible to assess their impact – benefit.

There’s an increased risk of bias with deviations from analysis plans after observing data, even when transparently reporting. With transparent reporting, observers can assess deviations and their rationale. The only way to achieve transparency is through preregistration.

Challenge 2: Discovery of Assumption Violations During Analysis

An example: Courtney discovers that the distribution of one of her variables has a ceiling effect and another is not normally distributed, violating the assumptions of her preregistered tests. Nevertheless, strategies are available to deal with contingencies in data-analytic methods without undermining diagnosticity. E.g. blinding, preregistering a decision tree, defining state etc.

Challenge 3: Data are Preexisting

An example: Ian uses data given by others to do his research. In most cases he does not know which variables/data are available to analyze until after data collection, making it difficult to preregister. The extent to which it is possible to test predictions on pretexting data depends on if decision analysis plan decisions are blind to the data. ‘Pure’ preregistration is possible if nobody has seen the data, e.g. an economist can make predictions of existing government data that hasn’t been released. But researchers that read summaries of reports or get advice on how to approach a data set by prior analysis might be influenced by their analysis.

Challenge 4: Longitudinal Studies and Large, Multivariate Datasets

An example: Lily leads a project making yearly observations of many variables over a 20-year period. Her group conducts dozens of investigations, Lily could not have preregistered the entire design and analysis plan for all future papers at the start.

Challenge 5: Many Experiments

An example: Natalie’s lab quickly gathers data, running multiple experiments a week. Preregistering every experiment seems burdensome for their efficient workflow. Teams that run many experiments usually do so in a paradigm where each experiment varies some key aspects of a common procedure. In this situation preregistration can be as efficient as the design of the experiments.

Challenge 6: A Program of Research

An example: Brandon’s research is high risk and most research outcomes are null results. Once in a while he gets a positive result with huge implications. Though he preregisters everything, he can’t be completely confidence in his statistical inference. There’s a chance of false positives. Another key element of preregistration – all outcomes of the analysis plan must be reported to avoid selective reporting.

Challenge 7: Few a Priori Expectations

An example: Matt doesn’t think that preregistration is useful to him because he thinks his research is discovery science. It’s usual to start research with few predictions. It’s less common for research to stay exploratory through a sequence of studies. If the data are used to formulate hypotheses instead of claiming evidence for the hypotheses, then the paper may be embracing post hoc explanations to open/test new areas of inquiry. But there are reasons to believe that sometimes postdiction is recast as prediction. In exploratory research, P values have unknown diagnosticity and using them can imply testing (prediction) rather than hypotheses generation (postdiction). Preserving the diagnosticity of P values means only reporting them when testing predictions.

Challenge 8: Competing Predictions

An example: Rusty and Melanie agree on the study design and analysis of their project but have competing predictions. This isn’t a challenge for preregistration. It has desirable characteristics that could lead to strong inference for favouring a theory. Prediction research can hold multiple predictions simultaneously.

Challenge 9: Narrative Inferences and Conclusions

An example: Alexandra preregistered her study, reporting all preregistered outcomes and distinguished the outcomes of tested pre-and-postdictions generated from the data. Some of her tests gave more interesting results than others and so her narrative, naturally, focused on the interesting ones. You can follow preregistration and still leverage chance in the interpretation of results. You could conduct 10 analyses, but the narrative of the paper could focus on two of them, increasing how the paper is applied and cited through inferential error. Can be solved statistically by e.g. applying alpha corrections so that positive narratives focus isn’t associated with inflation of false positives.

How can preregistration become the norm?

Culture today gives the means (biases and misuse), motive (to be punished), and opportunity (no prior commitments to predictions) for dysfunctional research practices. This is ow shifting to provide means, motive, and opportunity for robust practices through preregistration.

How can means be advanced for preregistration?

A barrier to preregistration is insufficient/ineffective training of good statistical/methodological practices – online education modules available to facilitate learning.
Research domains that require submission for ethics review for research on humans and animals should specify some of the methodology before doing the research.
Thesis proposals for students often require comprehensive design and analysis plans that can easily become preregistrations.

How could motive be advanced for preregistration?

Today there relatively weak incentives for research rigor and reproducibility, this is changing. Preregistration is required by US law for clinical trials and is necessary for publication.
Journal funders are beginning to adopt expectations for preregistration, researchers’ behaviour is expected to follow.
Some journals have come up with badges for preregistration as incentives to give credit for having preregistered with explicit designation on a published article.
Incentives for preregistration in publishing. The preregistration challenge offers $1,000 awards to researchers that publish results of a preregistered study.
Submitting their research question and methodology to journals for peer assessment before observing the outcomes.

How could opportunity be advanced for preregistration?

Existing domain specific and general registries make it possible for researchers in any discipline to preregister their research.

Conclusion

Sometimes researchers use existing observations to generate ideas about how the world works – postdiction. Other times they have an idea about how the world works and make new observations to test if the idea is a reasonable explanation – prediction. To make confidence inferences, it’s important to know the difference and preregistration solves this challenge by making researchers state how they’ll analyze data before they observe it, allowing them to confront a prediction with the chance of being wrong.

Access:

Public

1593 reads

A Manifesto for Reproducible Science – Munafo et al. - 2017 - Article

What is the problem?
What are measures that can be implemented when performing research? (methods)
What are measures that can be implemented when communicating research? (reporting and dissemination)
What are measures that can be implemented to support verification of research? (reproducibility)
What are measures that can be implemented when evaluating research? (evaluation)
What role do incentives play?
Conclusion

This paper argues for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation, and incentives.

What is the problem?

Scientific creativity is characterized by identifying novel and unexpected patterns in data. A major challenge is to be open to new and important insights while avoiding being misled by our tendency to see structures in randomness. Apophenia (seeing patterns in random data), confirmation bias (focusing on evidence that’s in line with our expectations/favoured explanation), and hindsight bias (seeing an event as having been predictable after it has occurred) combined can lead us to false conclusions.

If several potential analytic pipelines can be applied to high-dimensional data, false-positives are highly likely.

What are measures that can be implemented when performing research? (methods)

Protecting against cognitive biases

Effective solution to diminish self-deception and unwanted biases is blinding. Preregistration is an effective form of blinding because the data doesn’t exist and outcomes aren’t yet known.

Improving methodological training

Threats to the robustness of science could be improved by statistical training. Training in research practice that can protect against cognitive biases and the influence of distorted incentives is most important. Without formal requirements for continuing education, the best solution may be developing educational resources that are accessible, easy-to-digest and immediately and effectively applicable to research (e.g. brief, web-based modules on specific topics).

Implementing independent methodological support

Many clinical trials have multidisciplinary trial steering committees to give advice and oversee the design and conduct of a trial. The need for these committees came from the fact that financial conflicts of interest could exist in clinical trials (e.g. sponsors may be companies manufacturing the product being tests and (un)intentionally influence design, analysis, and interpretation). Including independent researchers could alleviate these influences.

Encouraging collaboration and team science

Solution to ‘lack of resources to improve statistical power’. Distributed collaboration across study sites facilitates high-powered designs and more potential for testing generalizability (as opposed to relying on the limited resources of single investigators).

What are measures that can be implemented when communicating research? (reporting and dissemination)

Promoting study preregistration

Can include the registration of a basic study design as well as a detailed pre-specification of study procedures, outcomes, and analysis plan. This was introduced to solve two problems: publication bias and analytical flexibility (outcome switching).

Publication bias: file drawer problem, the fact that more studies are conducted than published.
Outcome switching refers to changing the outcomes of interest in the study depending on the observed result.

Improving the quality of reporting

Preregistration improves discoverability of research, but this does not guarantee usability. Improving the quality and transparency in reporting research is necessary to address this. The Transparency and Openness Promotion (TOP) guidelines offer standards as a basis for journals and funders to incentivize or require more transparency in planning and reporting research.

Registered reports (RR) – initiative to eliminate various forms of bias in hypothesis-driven research, specifically evaluation of a study based on the results. RR’s divide peer review processing in two stages, before and after results are known. First reviewers assess a detailed protocol that includes study rationale, procedure, and an analysis plan. Publication of study outcomes are guaranteed if the authors adhere to the approved protocols, meet pre-specified checks, and conclusions are appropriately evidence-bound. They prevent publication bias by accepting articles before results are known. They also neutralize P hacking by knowing the hypotheses and analysis plans in advance.

Main objective against RR’s – format limits exploration or creativity by making authors follow a pre-specified methodology. But they place no restrictions on creative analysis practices or serendipity.

Authors can report the outcomes of any unregistered exploratory analyses, as long as those tests are clearly labelled as post-hoc.

What are measures that can be implemented to support verification of research? (reproducibility)

Promoting transparency and open science

How credible scientific claims are based on the evidence supporting them, which includes methodology applied, data acquired, and process of methodology implementation, data analysis, and outcome interpretation.

Open science is the process of making the content and process of producing evidence and claims transparent and accessible to others.

There are barriers to meet these ideals, including vested financial interested (scholarly publishing), and few incentives for researchers to pursue open practices.
Commercial and non-profit organizations are building new infrastructures like the Open Science Framework to make transparency easy and desirable for researchers.

What are measures that can be implemented when evaluating research? (evaluation)

Diversifying peer review

Pre-and-post-publication peer review mechanisms accelerate and expand the evaluation process. Sharing preprints enables researchers to get quick feedback on their work from a diverse community instead of waiting months for a few reviews in the conventional, closed peer review process.

Data sharing includes sharing data in public repositories, offering advantages in terms of accountability, data longevity, efficiency, and quality (reanalysis could catch crucial mistakes or fabrication).

Badges acknowledging open science practices – Center for Open Sciences suggested journals to assigned badges to articles with open data (and other open practices like preregistration and open materials).
The Peer Reviewers’ Openness Initiative – researchers signing this initiative pledge commit to not offer comprehensive review for any manuscript that doesn’t make its data publicly available without clear justification.
Requirements from funding agencies – NIH intends to make public access to digital scientific data the standard for all NIH-funded research. NSF requires submission of a data-management plan outlining how data will be stored and shared.

What role do incentives play?

Publication is the currency of academic sciences and increases likelihood of employment, funding, promotion, and tenure. Positive, novel, clean results are more likely to be published than negative results, replications and results with loose ends. Consequently, researchers are incentivized to produce the former, even at the cost of accuracy. Ultimately, incentives increase the likelihood of false positives in published literature. Shifting the incentives offers a chance to increase credibility and reproducibility in published results. There will always be incentives for innovative outcomes, but there should be incentives and rewards for transparent and reproducible research.

Conclusion

Challenges to reproducible science are systemic and culture, but that does not mean they can not be met. The measures described offer practical and achievable steps to improve rigor and reproducibility.

Access:

Public

1639 reads

When will ‘Open Science’ Become Simply ‘Science’? – Watson - 2015 - Article

What is the problem?
What is open data?
What is open access?
What is open methodology?
What is open sourcing?
What is open peer-review?
What is open education?

Open science means doing scientific research in a transparent way, and making results available to everyone.

What is the problem?

Right now, open science is seen as optional, e.g. open access to articles is offered at an extra cost. Imagine the opposite (e.g. having to pay to make your work closed).

The ‘mobile phone paradox’- mobile phones are a world changing invention allowing people to connect from wherever they are. But sometimes it does not work because of signal issues. Should we not have invented mobile phones because they don’t work sometimes? No. The same is true for open science, it won’t always work but it’s the right thing to do.

There are six commonly accepted pillars of open science.

What is open data?

Releasing raw and processed data from our experiments, allowing others to analyze it unrestrictedly. The author thinks all raw data generated throughout your experiment should be released (especially discarded data), at least enough to completely regenerate the analysis you did (replicability. If the human genome project had only published the ‘interesting data’, many scientific discoveries would have been delayed.

Moral argument to disclose all data: data doesn’t belong to the scientists’; they belong to the funder (e.g. taxpayers). Datasets should be freely available to those who funded them.

What is open access?

A model where papers are available for anyone to read without paying, and that license allows secondary use like text-mining. It is immoral to expect those funding the research to pay to have access to the results.

What is open methodology?

A methodology that has been described in enough detail to allow researchers to repeat and apply the work elsewhere. A main reason we publish is so that others can learn from what we have done, showing how you carried out an experiment is essential to that.

What is open sourcing?

Refers to open and free access to the blueprint of a product. If you use software in your experiment, the source code should be available to read. It is part of the methods section and is the easiest part to share (online). Software should drive the open-science movement.

What is open peer-review?

It is about transforming the peer review process, making it collaborative between authors and reviewers through constructive criticism. Part of it is to remove anonymity. Evidence showed open review to do no damage to the quality of peer reviews.

What is open education?

Refers to the open and free availability of open resources. You can still charge for education, but resources used to educate should be made freely available.

Access:

Public

1338 reads

Code of Ethics for Research in the Social and Behavioural Sciences - 2018 - Article

What are the principles?
What are some definitions?
What are the general procedures?
What is the scientific relevance, necessity, and validity?
How does informed consent work?
Exceptions: when is withholding information, deception, passive consent, or no consent acceptable?
How does compensation work?
What about data protection and privacy?
What is the ethics review committee?
What is the complaints procedure?
How does it work with generalized validity, multi-center research, and research at external institutions or locations?

The Code of Ethics gives guidelines for research in social and behavioural sciences involving theories. Guiding theme: apply or explain.

What are the principles?

Code of ethics is based on the following principles:

Respect the dignity of humans and their environment. Avoiding exploitation, treating participants with respect/care, protecting those with diminished autonomy.
Strive towards minimization of harm. Just distribution of benefits and burden, respecting potentially conflicting interest of diverse participants, communities etc.
Adopt an ethical attitude where researchers are mindful of the meaning, implications, and consequences of research for anyone affected by it.
Demonstrate the ethical attitude by:
- Active reflection on potential ethical issues during or consequently of research.
- Initiating assessment of potential drawbacks of research for individuals, communities, and society.
- Monitoring for developments that could impact ethical aspects of the research.
Account for and communicate on ethical reflection vis-à-vis different stakeholders, e.g. participants and communities, peers, students, funders etc.
Conduct scientifically valid research, plausibly leading to relevant insights in social and behavioural sciences.

How these principles are safeguarded vary depending on the field of research.

What are some definitions?

Social and behavioural sciences: field of science studying patterns/causes of human behaviour.
Participant: person taking part in research where data is collected from them.
Institute: university faculty, research institute, or graduate school in the social sciences adhering to the code of ethics.
Board: board of the instate.
Research plan: document addressing the rationale, background, objective, methodology, analyses, and relevant ethical aspects of a research project involving human participants.
Ethics review committee: committee of experts assigned by the board responsible of reviewing research plans on ethical aspects.
Personal data: data that can identify a person (participant).

What are the general procedures?

All institutes of social and behavioural sciences at Dutch universities subscribe to the guidelines.
Research on humans must be carried out according to a research plan.
Research plan identifies potential costs and benefits to all stake holders – emphasizes consequences to participants/communities.
Positive review of a research plan must be obtained from an ethics review committee established by the institute where the research is conducted, or those who carry main responsibility for the research.
Ethics review must be done before research starts. When this isn’t possible, it has to be done as soon as possible. Researcher is responsible for acting according to ethical principles.
Ethics review committee evaluates the research plan based on ethical guidelines/local implementations. Based on this: either approval/positive advice is issued or withheld.
Ethics review is conducted in line with relevant laws. Abroad: principal investigator is responsible for ensuring that research is conducted with regard for local laws, habits, and customs.
In case of unclear/conflicting laws or values, nature and circumstances of the dilemma are clearly documented, together with a resolution plan.
Ethics review committee can suspend/evoke a positive review of a research plan if there are reasonable grounds to assume that continuation would lead to unacceptable harm/burden to human participants involved.
Research must be covered by the regular legal liability insurance of the instate where the research is conducted or the body with primary responsibility for conducting the research.

What is the scientific relevance, necessity, and validity?

Research conducted will lead to relevant insights.
Research can be done for training purposes if participants are informed of the training purpose.
Insights can’t be gained by less intrusive means.
Research is carried out in suitable locations and supervised by people with necessary skills.
Research uses sound methodology.

How does informed consent work?

Participants must be given the chance to understand the nature, purpose, and consequences of participation, voluntariness of participation etc.
Consent must be obtained to collect and register personal data.
Mentally incompetent participant: informed consent obtained from their legal representative.
Minors
Minors 11-16: informed consent from both minor and parent/legal representative.
Minors: consent from one parent is enough.
>16: consent only from participant, good practice to inform parent/legal representative.
Participants are monitored for signs of discontent before, during, and where possible after research – alleviated appropriately.
Informed consent should be obtained for recording voices or images of participants.
Information is provided to the participant sufficiently in advance. The higher the impact/burden, the longer the time period.
Information is provided and consent asked in a comprehensible way for participant.
Informed consent is active (deliberate act of the participant – ‘opt-in’) unless there are special circumstances that call for passive consent – ‘opt-out’.
Deliberate or plausibly demonstrable acts of consent can be valid – writing, digitally, verbally etc.
Keep adequate records of when, how, and by whom, informed consent was given.
Indicate how the interest of this parties are protected – those not actively participating and informed consent is not obtained.
Additional informed consent must be obtained when there are changes in the nature, duration, focus etc. of a study.

Exceptions: when is withholding information, deception, passive consent, or no consent acceptable?

When preserving the integrity of research outweighs participant’s interest or is shown to be in the public interest. If information is withheld, participant will be given information after participation in a way that their consent remains intact.
Deception can only be employed if no other option is available and is justified by the value of the study.
Deception or withholding information isn’t allowed with procedures that are expected to cause physical/mental harm.
Any deception/withheld information must be explained to the participant s early as possible (right after participation). Participants must then be informed of their right to withdraw their data without negative consequences.
Passive consent is possible if: (a) active consent leads to disadvantages of validity or to the participants’ interest, (b) there’s little risk/burden, (c) participants are informed, (d) the opt-out procedure is straightforward.
No consent necessary for public observations if data collection happens anonymously – no registration of personal data.
Observation of specific groups only need consent from group members or an appropriate representative.
Collection of personal data calls for informed consent of the individual unless there’s a justified cause. This cause is justified by the institute’s legal office.
Re-use of data for new research when informed consent can’t be given by original participants calls for review to the ethics review committee, who decide whether re-use is justified.

How does compensation work?

Any compensation/benefits offered is fair.
Compensation doesn’t have a disproportionate effect on whether participants decide to participate.
Adequate compensation is provided when local resources are used.
Person conducting research and the instate where it’s being conducted receive reasonable compensation.

What about data protection and privacy?

Processing, storage, and publication that can lead to identification is guarded by relevant laws. Special care regarding highly sensitive personal data.
Protect those extra vulnerable to harm from being identified.

What is the ethics review committee?

Advisory body with at least 5 members, established by and reports to the board of an instate.
Board appoint executive secretary to committee.
Board responsible for instrumentation, financial support, and proper recording of all ethical reviews done.
Chair (vice chair if appointed0 and executive secretary form an executive board on the committee.
Expertise of members must cover major disciplines of instate and typical ethical issues involved.
Committee is responsible for acquiring/maintaining knowledge about current ethical issues and evaluate new developments.
Committee strives towards raising ethical awareness through constructive dialogue and timely information.
Committee must have access to ethical and legal expertise.
Working methods and procedure must be specified in regulation available to all stakeholders.

What is the complaints procedure?

Filed with the board, appeal can be lodged against a committee’s advice or decision according to the institute’s regulations.
There’s a publicly available complaints procedure for participants.

How does it work with generalized validity, multi-center research, and research at external institutions or locations?

If a decision is reached, it’s valid for all other Dutch institutes of behavioural and social sciences. No new review is necessary if research moves to another institute.
Ethical review responsibility lies mainly with principal investigator.
Multi-center research: review of different parts can be obtained separately from different institutes.
If research is conducted at an external organization, the research should (a) get permission from responsible authorities of external organization, or explain why unnecessary, and (b) check local ethical guidelines and procedures, in case of conflicting values, principles etc. check with the ethics review committee of home institute.
If there is an absent local scientific and ethical infrastructure – assess how the research plan fits with local values, customs, traditions etc.

Access:

Public

1932 reads

Science and Ethics in Conducting, Analyzing and Reporting Psychological Research – Rosenthal - 1994 - Article

What are issues of design?
What are issues of recruitment?
How does bad science make for bad ethics?
What are costs and utilities?
What does data dropping entail?
Is data exploitation always bad?
Can meta-analysis be used as an ethical imperative?
What are issues in reporting psychological research?
What can we conclude?

The article discusses scientific and ethical issues relevant to conducting psychological research. Looking at considerations of research design, procedures, and recruitment of human participants.

What are issues of design?

A safe research proposal can be ethically questionable because of design issues. For example: research hypothesizing private schools to improve children’s intellectual functioning more than public schools. This design does not allow reasonable causal inference because there is no randomization or consideration of other hypotheses. Research is questionable if:

Participants’ time is taken from more profitable experiences/treatments.
Poor quality design leads to unwarranted/inaccurate conclusions possibly harming the society funding the research.
Giving money and time to poor quality research keeps resources from better science.

What are issues of recruitment?

Hyperclaiming: telling prospective participants (+ granting agencies, colleagues etc.) that research is likely to achieve goals that it is unlikely to achieve. Colleagues and administrators can evaluate our claims fairly but our participants cannot. We should be honest with them about the realistic goals of the study.

Causism: tendency to imply a causal relationship where it has not been established.

Characteristics of causism:

No appropriate evidential base.
Presence of language implying cause (‘the effect of’, ‘as a result of’ etc.) where appropriate language would be ‘was related to’ or ‘could be inferred from’.
Self-serving benefits to the causist because it makes the result appear more important than it really is.

If the causist is unaware of the causism, it reflects poor scientific training. If they are aware of it, reflects unethical misrepresentation and deception.

A description of a proposed research study using causal language represents an unfair recruitment device used to increase potential participation rates.

How does bad science make for bad ethics?

The author proposes that institutional review boards should consider the technical scientific competence of investigators whose proposals they evaluate. Poor quality research can make for poor quality education. Asking a participant to participate in bad research increases the likelihood of them acquiring misconceptions about the nature of science and psychology rather than benefitting educationally.

What are costs and utilities?

When presented questionable research proposals, investigators/review boards employ a cost-utility analysis where the costs of doing a study (time, money, negative effects on participants) are evaluated against utilities (benefits to participants, science, the world etc.). High quality studies/studies with important topics have a utility outweighing the costs. But it is hard to decide if a study should be done when the costs and benefits are equal. However, the costs for failing to do research also should be evaluated – focuses on the benefit for future generations or participants. Example: if people can receive free care during a study that otherwise they couldn’t afford, is it ethical to not conduct this research?

What does data dropping entail?

The goal to have more support for your hypothesis:

Outlier rejection: it is likely that researchers drop outliers that are inconsistent with their hypothesis than those falling in line with their hypothesis. Outlier rejection should be reported. Results with outlies should be reported if you decide to drop them.
Subject selection: different type of data dropping. Subset of the data is not included in the analysis. Even if there may be good technical reasons, there are still ethical issues like when just subsets not supporting the researcher’s hypothesis are dropped. If you drop subsets, readers should be informed about it and what the results were. Similar considerations apply when results for one or more variables are not reported.

Is data exploitation always bad?

This issue has subtler ethical implications. We are taught that it is improper to snoop around in our data (analyze and reanalyze). It makes for bad science because while snooping affects p values, it is likely to turn up something new. It makes for bad ethics because data are expensive in time, effort, and money and looking further into data may turn up something you may not have found otherwise.

The author says if the research is wroth conducting, it is also worth taking a closer look at it. Replications are needed anyway whether you snoop or not. Bonferroni adjustments can help with the significant p-value that you may find after exploiting your data.

Can meta-analysis be used as an ethical imperative?

Meta-analyses are a set of concepts and procedures used to summarize any domain of research. We see them as more accurate, comprehensive, and statistically powerful compared to traditional literatures review because they have more information. This leads to:

More accurate estimates of effect sizes and relationships.
More accurate estimates of overall significance levels of the research domain.
More useful information about the variables moderating the magnitude of the effect.
Increase of utilities: time, effort, costs are all more justified when datasets are in a meta-analysis because we can learn more from our data.
Ethical implications of not doing meta-analyses: failing to employ met-analytic procedures means you lose the opportunity to make use of past research.

Meta-analyses try to explain the variation in effect sizes from different studies. It seems to no longer be acceptable to fund research resolving a controversy unless an investigator as already done a meta-analysis to decide if there really is a controversy.

Pseudocontroversies: meta-analysis resolves controversies because it eliminates two common problems in evaluating replications:

When failing to get a significant effect in the replication study, we fail to replicate – failure to replicate is measured by the size of the difference between the effect sizes of two studies.
Believing that if there is an effect in a situation, each study of that situation will show a significant effect – the chance to find an effect when there really is one is often quite low.

Significance testing: meta-analysis tries to record the actual level of significance obtained (instead of whether a study reached a certain level) on the standard normal deviate that corresponds to a p value. The use of signed normal deviates:

Increase a study’s informational value.
Increases a study’s utility.
Changes a study’s cost-utility ratio and ethical value.

Meta-analyses increase research utility and ethical justification, providing accurate effect size estimates.

What are issues in reporting psychological research?

Misrepresenting findings

Some misrepresentations of findings are more obviously unethical than others.

Intentional misrepresentation: includes fabricating data, intentionally/knowingly allocating subjects to experimental and control conditions to support the hypothesis. Not being blind to treatment condition and recording responses, and research assistants recording responses knowing the hypothesis and treatment conditions.
Unintentional misrepresentation: recording, computational, and data analytic errors can all lead to inaccurate results. This includes causist language and questionable generalizability. Errors in data diminish research utility and shift cost-utility ratio in an unfavourable direction.

Misrepresenting credit

Problems of authorship: many papers are multi-authored making it difficult to allocate authorship credit. Who becomes a coauthor vs a footnote? Who is assigned first or last coauthor in the listing?
Problems of priority: an issue between research groups. Who got the idea first?

Failing to report/publish

What was not reported and why? The two biggest forms of failure to report are self-censoring and external censoring.

Self-censoring: can be admirable when a study is done badly, could be a service to science to start over. Less admirable reasons could be failing to report data that contradict earlier research or personal values – poor science and ethics.
- Good practice is to report all results that give information on the hypothesis and give data that other researchers could use.
External censoring: progress and the slowing of progress in science depend on external censoring. Sciences would maybe be more chaotic if it weren’t for censorship by peers who keep bad research from being released. Two major bases for external censorship:
- Evaluation of methodology used in a study.
- Evaluation of results obtained in a study.

What can we conclude?

The ethical quality of our research is not independent of the scientific quality of our research.

Access:

Public

1607 reads

On the Social Psychology of the Psychological Experiment: Demand Characteristics and their Implications - Orne - 2002 - Article

What is the problem?
What are demand characteristics?
What factors affect a subject’s reaction to stimuli in an experimental setting?
Why do people comply with such a wide range of instructions/requests? (motivational aspects)
In what kind of contexts and circumstances do demand characteristics become significant in determining the behaviour of subjects in experimental situations?
What are experimental techniques to study demand characteristics?
Summary and outlook

What is the problem?

The experimental method tries to abstract certain variables from complicated situations in nature, and reproduce parts of them in the lab in order to determine the effect of certain variables. This allows for generalization from information obtained in the lab back to the original situation.

Subjects being observed are usually viewed as passive responders instead of active participants. Additionally, experimental stimuli tend to be defined in terms of what is done to the subject, not how they react (limited dynamism). These factors combined lead to issues in reproducibility and ecological validity, which are necessary for meaningful experimentation.

What are demand characteristics?

A participant actively wants to contribute to research and will react to ‘demand characteristics’. A participant’s behaviour in an experiment is determined by perceived demand characteristics of the setting as well as by experimental variables.

Demand characteristics are the total cues conveying a hypothesis to participants which determine their behaviour (rumours about experiment, explicit/implicit information given during the experiment, laboratory setting, views based on previous knowledge and experience of participant etc.). the perceived demand characteristics in any experiment can vary.

What factors affect a subject’s reaction to stimuli in an experimental setting?

Motivation to comply with experimenter’s instructions.
Perception of behaviour research.
Nature of cues likely picked up etc.

It is proposed that these factors must be further elaborated and the parameters of an experimental setting should be more carefully defined so that sufficient controls can be designed to isolate the effects of the setting from the effects of the experimental variables.

In our culture the roles of subject and experimenter are well understood and carry well-defined mutual role expectations within an experimental setting:

Subjects put themselves under experimenter’s control.
Subjects tolerate boredom/discomfort if required.
Shared assumption of the purpose justifies request for many things from a participant in an experiment.
Effort and discomfort are justified by a higher purpose.

It is hard to design an experiment that tests the degree of social control in an experiment because of the already high degree of control in an experimental situation.

Why do people comply with such a wide range of instructions/requests? (motivational aspects)

They want the experiment to succeed and be a ‘useful’ participant.
They attribute meaning to their actions.
They identify with scientific goals.
Subjects are concerned with their performance: reinforcing self-image, concerned with the usefulness of their performance.
Subjects will act in line with ‘being a good participant’ and in a way to validate the perceived/assumed hypothesis.

In what kind of contexts and circumstances do demand characteristics become significant in determining the behaviour of subjects in experimental situations?

Demand characteristics can not be eliminated, but you can study their effect.
Studies need to take the effects of demand characteristics into account.
Response to demand characteristics is not just conscious compliance – perception of experimenter’s hypothesis tends to be a more accurate predictor of actual performance than a statement about what the participants think they have done on a task.
The most powerful demand characteristics, in determining behaviour, convey the experiment’s purpose effectively, but not obviously. If the subject knows the experimenter’s expectations, there may be a tendency for biasing in the opposite direction.

What are experimental techniques to study demand characteristics?

Studying each participant’s perception of the experimental hypothesis. Determine how they correlate with observed behaviour.
Determine which demand characteristics are perceived, post-experimental inquiry – inquiry procedures required to not provide expectation cues.
- Criticisms of this inquiry: (a) inquiry procedure is subject to demand characteristics and (b) participant’s won behaviour can influence their perception of the purpose of the study. So a correlation of the participant’s expectations vs. experimenter’s hypothesis may not have much to do with determining behaviour.
Pre-experimental inquiry (approximation, not proof) – show procedure, materials, requirements, then ask participant’s what the hypothesis is without them having done experimental tasks. Also ask how they would have asked it if they had been part of the experiment.
- Leads to formulating possible hypotheses without knowing about their own reactions to testing.
- Participant’s own behaviour in the experiment can’t influence their perception of the purpose of the study because they did not take part in it yet.
- Participants of pre-experimental inquiry can’t take part in the actual experiment anymore, but must be drawn from the same population.
- If subjects describe behaviour matching the one actually performed at testing, it becomes possible that demand characteristics are responsible for behaviour. It ispossible that demand characteristics, instead of experimental variables, could account for the behaviour of an experimental group.
Hold the demand characteristics constant and get rid of the experimental variable. Simulating participants: instructed to act as if exposed to experimental variable.
Blind design controls for experimenter bias is important, as experimenter bias will also influence the participant’s assumed hypothesis.

All these techniques depend on active cooperation of the control participants. This suggests that the subject does not just respond in these control situations but rather is required to actively solve a problem. We should view experiments as a special form of social interaction. It is suggested that control of demand characteristics could lead to greater replicability and generalizability of psychological experiments.

Summary and outlook

Experiments as a special form of social interaction.
Subject should be recognized as an active participant.
Subject’s behaviour in an experiment is a function of the totality of the situation, including the experimental variables and demand characteristics.
Control for demand characteristics is suggested to lead to higher replicability and ecological validity.
Study and control of demand characteristics are not just a matter of good research technique; it is an empirical issue to figure out in what context they significantly influence participants’ experimental behaviour.

Access:

Public

1644 reads

On the Social Psychology of the Psychological Experiment: Demand Characteristics and their Implications - Orne - 2002 - Article

What is the problem?
What are demand characteristics?
What factors affect a subject’s reaction to stimuli in an experimental setting?
Why do people comply with such a wide range of instructions/requests? (motivational aspects)
In what kind of contexts and circumstances do demand characteristics become significant in determining the behaviour of subjects in experimental situations?
What are experimental techniques to study demand characteristics?
Summary and outlook

What is the problem?

What are demand characteristics?

What factors affect a subject’s reaction to stimuli in an experimental setting?

Motivation to comply with experimenter’s instructions.
Perception of behaviour research.
Nature of cues likely picked up etc.

In our culture the roles of subject and experimenter are well understood and carry well-defined mutual role expectations within an experimental setting:

Subjects put themselves under experimenter’s control.
Subjects tolerate boredom/discomfort if required.
Shared assumption of the purpose justifies request for many things from a participant in an experiment.
Effort and discomfort are justified by a higher purpose.

It is hard to design an experiment that tests the degree of social control in an experiment because of the already high degree of control in an experimental situation.

Why do people comply with such a wide range of instructions/requests? (motivational aspects)

They want the experiment to succeed and be a ‘useful’ participant.
They attribute meaning to their actions.
They identify with scientific goals.
Subjects are concerned with their performance: reinforcing self-image, concerned with the usefulness of their performance.
Subjects will act in line with ‘being a good participant’ and in a way to validate the perceived/assumed hypothesis.

In what kind of contexts and circumstances do demand characteristics become significant in determining the behaviour of subjects in experimental situations?

Demand characteristics can not be eliminated, but you can study their effect.
Studies need to take the effects of demand characteristics into account.
Response to demand characteristics is not just conscious compliance – perception of experimenter’s hypothesis tends to be a more accurate predictor of actual performance than a statement about what the participants think they have done on a task.
The most powerful demand characteristics, in determining behaviour, convey the experiment’s purpose effectively, but not obviously. If the subject knows the experimenter’s expectations, there may be a tendency for biasing in the opposite direction.

What are experimental techniques to study demand characteristics?

Studying each participant’s perception of the experimental hypothesis. Determine how they correlate with observed behaviour.
Determine which demand characteristics are perceived, post-experimental inquiry – inquiry procedures required to not provide expectation cues.
- Criticisms of this inquiry: (a) inquiry procedure is subject to demand characteristics and (b) participant’s won behaviour can influence their perception of the purpose of the study. So a correlation of the participant’s expectations vs. experimenter’s hypothesis may not have much to do with determining behaviour.
Pre-experimental inquiry (approximation, not proof) – show procedure, materials, requirements, then ask participant’s what the hypothesis is without them having done experimental tasks. Also ask how they would have asked it if they had been part of the experiment.
- Leads to formulating possible hypotheses without knowing about their own reactions to testing.
- Participant’s own behaviour in the experiment can’t influence their perception of the purpose of the study because they did not take part in it yet.
- Participants of pre-experimental inquiry can’t take part in the actual experiment anymore, but must be drawn from the same population.
- If subjects describe behaviour matching the one actually performed at testing, it becomes possible that demand characteristics are responsible for behaviour. It ispossible that demand characteristics, instead of experimental variables, could account for the behaviour of an experimental group.
Hold the demand characteristics constant and get rid of the experimental variable. Simulating participants: instructed to act as if exposed to experimental variable.
Blind design controls for experimenter bias is important, as experimenter bias will also influence the participant’s assumed hypothesis.

Summary and outlook

Experiments as a special form of social interaction.
Subject should be recognized as an active participant.
Subject’s behaviour in an experiment is a function of the totality of the situation, including the experimental variables and demand characteristics.
Control for demand characteristics is suggested to lead to higher replicability and ecological validity.
Study and control of demand characteristics are not just a matter of good research technique; it is an empirical issue to figure out in what context they significantly influence participants’ experimental behaviour.

Access:

Public

1644 reads

A Power Primer: Tutorials in Quantitative Methods for Psychology – Cohen - 1992 - Article

What is the problem?
What are the components of power analysis?

What is the problem?

On reason for why statistical power analysis in research is continuously ignored in behavioural science is the difficulty with the standard material. There has not been an increase in the probability of obtaining a significant result in the last 25 years. Why?

Everyone agrees on the importance of power analysis, and there are many ways to estimate sample sizes. But part of the reason for this could be the low level of consciousness about effect size; it is like the only worry about magnitude in a lot of psychological research is with regard to the statistical test result and its p value, not to the psychological phenomenon being studied. Some blame this on the precedence of Fisher’s null hypothesis testing; cut go-no-go decision over p=0.05. The author suggests that the neglect of power analysis represents the slow movement of methodological advance. Another suggestion is that researchers thinking the reference material for power analysis is too complicated.

What are the components of power analysis?

Statistical power analysis uses the relationships between four variables involved in statistical inference:

The significance criterion α: the risk of falsely rejecting the null hypothesis (H0) and committing a type I error, α, represents a policy: the maximum risk of such a rejection.
Power: the statistical power of a significance test is the long-term probability, given ES, N, and α of rejecting H0. When ES does not equal zero, H0 is false, so failure to reject also causes an error (= type II error with probability of β). Power is equal to 1 – β (probability of rejecting a false H0. Taken with α = 0.05, power of 0.80 results in a β:α ratio of 4:1 (0:20 to 0.05) of the two risks.
Sample size (N): in planning the researchers needs to know the N needed to get the desired power for a specified alpha and hypothesized ES. N increases when:
1. Desired power increases
2. ES decreases
3. Α decreases.
Population effect size (ES): degree to which H0 is believed to be false. N or power can not be determined without the ES. The degree to which H0 is false is shown by the difference between H0 and H1 (ES). For all, H0 = ES is 0.

d – the ES index for the t-test of the difference between the independent means (difference in means divided by population standard deviation). The H0 is d = 0 (no difference between group means). The small, medium, and large ES’s (H1’s) are d – 0.20, 0.50, and 0.80.

Using Cohen’s table, we can find the necessary N’s for different powers and ES’s.

To detect a medium difference between two independent sample means at α = 0.05 requires N = 64 in each group.
For a significance test of a sample r at α = 0.01 when the population r is large, a sample size of 41 is required. At an α = 0.05 a sample size of 28.

Access:

Public

1906 reads

Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience – Button et al. - 2013 - Article

What is the purpose of this paper?
What are the three main problems?
What are some key statistical terms?
What are additional biases associated with low power?
What are the implications for the likelihood that a research finding reflects a true effect?
What recommendations are there for researchers?
What can we conclude?

What is the purpose of this paper?

A study with low power has less likelihood of detecting a true effect, but it is less well appreciated that low power also decreases the likelihood that a significant result reflects a true effect. This paper shows that average statistical power of studies in neurosciences is low. Consequences include: overestimates of ES and low reproducibility. There are also ethical dimensions: unreliable research and inefficient/wasteful.

The paper discusses issues that arise when low-powered research is frequent. They are divided into two categories: (1) concerning problems that are mathematically expected even if the research is otherwise perfect, (2) concerning problems that reflect biases tending to co-occur with studies of low power or become worse in small, underpowered studies.

What are the three main problems?

The main problems contributing to producing unreliable findings with low power are:

Low probability of finding true effects
Low positive predictive value (PPV) when an effect is claimed
An exaggerated estimate of the effect size when a true effect is discovered

What are some key statistical terms?

CAMARADES: collaborative approach to meta-analysis and review of animal data from experimental studies. A collaboration aiming to decrease bias and improve method quality and reporting in animal research. Promotes data-sharing.
Effect size: standardized measure quantifying the size of a difference between two groups or strength of an association between two variables.
Excess significance: phenomenon where literature has an excess of statistically significant results due to biases in reporting. Mechanisms contributing to bias: study publication bias, selective outcome reporting, and selective analysis bias.
Fixed and random effects: a fixed-effect meta-analysis assumes that the underlying effect is the same in all studies and that any variation is because of sampling errors.
Meta-analysis: statistical methods for contrasting and combining results from different studies to give more powerful estimates of the true ES.
Positive predictive value (PPV): probability that a ‘positive’ finding reflects a true positive effect; depends on prior probability of it being true.
Proteus phenomenon: the situation where the first published study is often most biased towards an extreme result (winner’s curse). Replication studies tend to be less biased toward the extreme – find smaller ES’s of contradicting effects.
Statistical power: probability that a test will correctly reject a false H0. The probability of not making a type II error (probability making type II error = β and power = 1 – β).
Winner’s curse: the ‘lucky’ scientists who makes a discovery is cursed by finding an inflated estimate of the effect. Most severe when the thresholds (significance) are strict and studies too small – low power.

What are additional biases associated with low power?

Firstly, low-powered studies are more likely to give a wide range of estimates of the magnitude of an effect. This is known as ‘vibration of effects’: situation where a study obtains different estimates of an effect size depending on the analytical options it implements. Results can vary depending on the analysis strategy, especially in small studies.

Secondly, publication bias, selective data analysis, and selective reporting of outcomes are likely to affect low-powered studies.

Third, small studies may be lower quality in other aspects of their design as well. These factors can further exacerbate the low reliability of evidence attained in studies with low power.

What are the implications for the likelihood that a research finding reflects a true effect?

Trying to establish the average power in neuroscience is hampered by the issue that the ES is unknown. One solution may be using data from meta-analysis (estimate ES). Studies contributing to meta-analysis, however, are subject to the same problem of unknown ES.

Results show that the average power of studies in neuroscience is probably not more than between 8%-31%. This has major implications for the field.

Likelihood that any significant finding is actually a true effect is small (PPV decreases when power decreases).
Ethical implications: inefficient and wasteful in animal/human testing.

What recommendations are there for researchers?

Perform an a priori power calculation: estimate ES you are looking for and design your study accordingly.
Disclose methods and findings transparently.
Pre-register your study and analysis plan: this clarifies confirmatory and exploratory analyses, reducing opportunities for non-transparency.
Make study materials and data available – improving quality of replication.
Work collaboratively to increase power and replicate findings; combining data increases total N (and power) while minimizing labour/resources.

What can we conclude?

Researchers can improve confidence in published reports by noting in the test how they determined their sample size, all data exclusions, all data manipulations, and all measures in the study. When stating that is not possible, disclosure of rationale and justification of deviations will improve readers’ understanding and interpretation of reported effects and what level of confidence in the reported effects is appropriate.

Access:

Public

1576 reads

The Nature and History of Experimental Control - Boring - 1954 - Article

What does ‘control’ mean?
What are Mill’s four methods of experimental inquiry? (Logic, 1843)
What is control as restraint/guidance?
What is control as a check/standard of comparison?

What does ‘control’ mean?

Control has three meanings:

A check: verification – standard of comparison sued to check inferences deduced from an experiment.
A restraint: checking in the sense of maintaining constancy – keeping experimental conditions constant and also to alter the independent variable according to precise known predetermination.
A guide/directing: producing a precisely predetermined change, a constant and restrained change.

What are Mill’s four methods of experimental inquiry? (Logic, 1843)

Method of agreement: if A is always followed by a, then A presumably is the cause of a. Mere agreement does not provide rigorous proof. Inference of causation is not safe when based on just agreement.
Method of difference: if A is always followed by a, and no-A is always followed by not-a, then A certainly causes a. this is equal to adding the control observation. This method gives control as a verifying check.
Method of residue: if ABC is a known cause of abc, and BC a cause of bc, then A must cause a, even though A can’t be produced without BC nor a without bc.
Method of concomitant variations: exists when there is a serie of differences, and in any other pair of concomitances, one concomitance furnishes a comparison or control of the others.

What is control as restraint/guidance?

The meaning of control as restrain/guidance is the common meaning of the term. In science it applies to keeping experimental conditions constant and altering the independent variable according to precise known predetermination.

What is control as a check/standard of comparison?

Control observations, series, and experiments – use of control in its original meaning as a check. It was in the 1870s that the world control started to be used in the sense of a check/standard of comparison regarding where a difference is expected to lie.

Also around this time the meaning of the world was used in biology by Darwin.
Wundt was already enough to control a piece of apparatus with an objective check, but the only control he admitted as valid for the human observer was rigorous in psychological observation.
As experimental psychology developed, the design of its experiments became more elaborate. Speaking approximately, we could say that the formal design first developed in psychophysics, then in reaction experiments, then in memory experiments.
The first investigation to use the full design (fore-test – practice – after-test) in the experimental group to be compared with (fore-test – nothing – after-test) in the control group was Winch in 1908, and the general use of control groups in these experiments begins then.

Access:

Public

1860 reads

Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data - Rohrer - 2018 - Article

What is the purpose of the article?
What are directed acyclic graphs?
What is a back-door path?
How do we control for a variable?
What are examples of collider bias?
What are some definitions?
What are some examples of DAGs?
Summary

What is the purpose of the article?

The article discusses causal inference based on observational data, introducing readers to graphical causal models that can provide a powerful tool for thinking more clearly about the interrelations between variables.

Researchers from different areas use different strategies to deal with weak observational data (manipulating the independent variable can sometimes be unfeasible, unethical, or impossible).

E.g. using surrogates can lead to valuable insights but also comes with a trade-off of decreased external validity.
Some researchers try to cautiously avoid causal language.
Many have tried to statistically control for third variables. Often these attempts lack proper justification.

This article aims to give psychologists a primer to a more principles approach to make causal inferences based on observational data. It discusses the improvement of causal inferences on observational data through using directed acyclic graphs (DAGs). They provide visual representations of causal assumptions.

What are directed acyclic graphs?

DAG’s consist of nodes (representing variables in research) and arrows (indicating the direction of the relationship). It can display the direction of causation or a lurking (confounding) variable and only contain one headed arrow. They are ‘acyclic’ because they don’t allow for cyclic paths in which variables become their own ancestors. A variable cannot causally affect itself.

What is a back-door path?

If we want to derive a valid causal conclusion we need a causal DAG that’s complete because it includes all common causes of all pairs of variables that are included in the DAG. After such a DAG is built, ‘back-door paths’ can be recognized. These are (non-causal) paths starting with an arrow pointing to the independent variable and ending with an arrow pointing to the dependent variable (indicating there may be a common factor affecting both treatment and outcome). They are problematic whenever they convey an association and can show a spurious association.

The purpose of third variable control is to block back-door paths. If all back-door paths between variables can be blocked, the causal effect connecting the independent and dependent variables can be identified.

How do we control for a variable?

Stratified analysis: stratify the sample controlling for confounders. Maybe unfeasible if the third variable needing control has many levels, if it is continuous, or if multiple third variables and interactions need to be accounted for.
Including third variables in regression models: dependent variable can be regressed on both the independent variable and the covariate to control for the covariate’s effects. Does not guarantee adequate adjustment for the covariate.
Matching: when there is a need to control for many third variables. Propensity-score matching is popular in social sciences but it fails to properly identify the causal effect.

What are examples of collider bias?

Nonresponse bias: example, if a researcher analyzes only completed questionnaires, and the variables of interest are associated with questionnaire completion. Assuming we are interested in the association between grit and intelligence and our assessment is burdensome. Grit and intelligence make it easier for respondents to push through and finish it. Questionnaire completion is therefore a collider between grit and intelligence.
Attrition bias: systematic errors caused by unequal loss of participants. If only remaining respondents are included in analysis, spurious associations may arise and open up back-door paths between variables of interest.

What are some definitions?

Ancestor: variable causally affecting another variable, influencing it directly (ancestor > X) or indirectly (ancestor > mediator > X). direct ancestors are called parents.
Blocked path: path containing (a) collider that the analysis has not been conditioned on or (b) a non-collider (confounder or mediator) that the analysis has been conditioned on. Does not transmit an association between variables.
Causal path: path consisting of only chains that can convey a causal association if unblocked.
Chain: causal structure of the form A > B > C.
Collider: a variable in the middle of an inverted fork (A > collider
Conditioning on a variable: process of introducing information about a variable into an analysis (statistical control or sample selection).
Confounder: variable in the middle of a fork (A C).
Descendant: variable causally affected by another variable, directly (X > descendant) or indirectly (X > mediator > descendant). Direct descendants are called children.
Fork: structure of the form A C.
Inverted fork: structure of the form A > B
Mediator: variable middle in a chain (A > mediator > C.
Node: variable in a DAG.
Non-causal path: path containing at least one fork or inverted fork and can show a non-causal association when unblocked.

What are some examples of DAGs?

Educational attainment grades (fork).
Intelligence > educational attainment > grades (chain).
Intelligence > grades

Summary

The practice of making causal inferences based on observational data depends on awareness of potential confounders and meaningful statistical control (or non-control) taking into account estimation issues like nonlinear confounding and measurement error. Back-door paths should be considered before data is collected to make sure all relevant variables are measured. Additionally, variables that should not be controlled for (colliders and mediators) need to be considered.

Access:

Public

2163 reads

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: Vintage Supporter

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results