Wednesday, 18 July 2012

The Dirty Dozen: A wish list for psychology and cognitive neuroscience

It’s been quite a month in science. 

On the bright side, we probably discovered the Higgs boson (or at least something that smells pretty Higgsy), and in the last few days the UK Government and EU Commission have made a strong commitment to supporting open-access publishing. In two years, so they say, all published science in Britain will be freely available to the public rather than being trapped behind corporate paywalls. This is a tremendous move and I applaud David Willetts for his political courage and long-term vision.

On the not-so-bright side, we’ve seen a flurry of academic fraud cases. Barely a day seems to pass without yet another researcher caught spinning yarns that, on reflection, did sound pretty far-fetched in the first place. What’s that? Riding up rather than down an escalator makes you more charitable? Dirty bus stops make you more racist? Academic fraudsters are more likely to have ground-floor offices? Ok, I made that last one up (or rather, Neuroskeptic did) but if such findings sound like bullshit to you, well funnily enough they actually are. Who says science isn’t self-correcting?

We owe a great debt to Uri Simonsohn, the one-man internal affairs bureau, for judiciously uncovering at least three cases of fraudulent practice in psychological research. So far his investigations have led to two resignations and counting. Bravo. This is a thankless task that will win him few friends, and for that alone I admire him.

And as if to remind us that fraud is by no means unique to psychology, enter the towering Godzilla of mega-fraud – Japanese anaesthesiologist, Yoshitaka Fujii, who has achieved notoriety by becoming the most fraudulently productive scientist ever known.

(As an aside, has anyone ever noticed how the big frauds in science always seem to be perpetrated by men? Are women more honest or do they just make savvier fraudsters?)

Along with all the talk of fraud in psychology, we have had to tolerate the usual line-up of  ‘psychology isn’t science’ rants from those who ought to learn something before setting hoof to keyboard. Fortunately we have Dave Nussbaum to sort these guys out, which he does with a steady hand and a sharp blade. Thank you, Dave!

With psychological science facing challenges and shake-ups on so many different fronts, the time seems ripe for some self-reflection. I used to believe we had a firm grasp on methodology and best practice. Lately I’ve come to think otherwise.

So here’s a dirty dozen of suggested fixes for psychology and cognitive neuroscience research that I’ve been mulling over for some time. I want to stress that I deserve no credit for these ideas, which have all been proposed by others.

1.     Mandatory inclusion of raw data with manuscript submissions

No ifs. No buts. No hiding behind the lack of ethics approval, which can be readily obtained, or the vagaries of the Data Protection Act. Everyone knows data can be anonymised.

2.     Random data inspections

We should conduct fraud checks on a random fraction of submitted data, perhaps using the methodology developed by Uri Simonsohn (once it is peer reviewed and judged statistically sound – as I write this, the technique hasn’t yet been published). Any objective test for fraud must have a very low false discovery rate because the very worst thing would be for an innocent scientist to be wrongly indicted. Fraudsters tend to repeat their behaviour, so the likelihood of false positives in multiple independent data sets from the same researcher should (hopefully) be infinitesimally small.

3.     Registration of research methodology prior to publication

Some time ago, Neuroskeptic proposed that all publishable research should be pre-registered prior to being conducted. That way, we would at least know from the absence of published studies how big the file-drawer is. My first thoughts on reading this were: why wouldn’t researchers just game the system, “pre” registering their research after the experiments are conducted? And what about off-the-cuff experiments conjured up over a beer in the pub?

As Neuroskeptic points out, the first problem could be solved by introducing a minimum 6-month delay between pre-registration and data submission. Also, all prospective co-authors of a pre-registration submission would need to co-sign a letter stating that the research has not yet been conducted.

The second problem is more complicated, but also tractable. My favourite solution is one posed by Jon Brock. Empirical publications could be divided into two categories, Experiments and ObservationsExperiments would be the gold standard of hypothesis-driven research. They would be pre-registered with methods (including sample size) and proposed analyses pre-reviewed and unchangeable without further re-review. Observations would be publishable but have a lower weight. They could be submitted without pre-registration, and to protect against false positives, each experiment from which a conclusion is drawn would be required to include a direct internal replication.

4.     Greater emphasis on replication

It’s a tired cliché, but if we built aircraft the way we do psychological research, every new plane would start life exciting and interesting before ending in an equally exciting fireball. Replication in psychology is dismally undervalued, and I can’t really figure out why this is when everyone, even journal editors, admit how crucial it is. It’s as though we’re trapped in some kind of groupthink and can’t get out. One solution, proposed by Nosek, Spies and Motyl, is the development of a metric called the Replication Value (RV). The RV would tell us which effects are most worth replicating. To quote directly from their paper, which I highly recommend:

Metrics to identify what is worth replicating. Even if valuation of replication increased, it is not feasible – or advisable – to replicate everything. The resources required would undermine innovation. A solution to this is to develop metrics for identifying Replication Value (RV)– what effects are more worthwhile to replicate than others? The Open Science Collaboration (2012b) is developing an RV metric based on the citation impact of a finding and the precision of the existing evidence of the effect. It is more important to replicate findings with a high RV because they are becoming highly influential and yet their truth value is still not precisely determined. Other metrics might be developed as well. Such metrics could provide guidance to researchers for research priorities, to reviewers for gauging the “importance” of the replication attempt, and to editors who could, for example, establish an RV threshold that their journal would consider as sufficiently important to publish in its pages.

I think this is a great idea. As part of the manuscript reviewing process, reviewers could assign an RV to specific experiments. Then, on a rolling basis, the accepted studies that are assigned the highest weightings would be collated and announced. Journals could have special issues focusing on replication of leading findings, with specific labs invited to perform direct replications and the results published regardless of the outcome. This method could also bring in adversarial collaborations, in which labs with opposing agendas work together in an attempt to reproduce each other’s results.

5.     Standardise acceptable analysis practices

Neuroimaging analyses have too many moving parts, and it is easy to delude ourselves that the approach which ends up ‘working’ (after countless reanalyses) is the one we originally intended. Psychological analyses have fewer degrees of freedom but this is still a major problem. We need to formulate a consensus view on gold standard practices for excluding outliers, testing and reporting covariates, and inferential approaches in different situations. Where multiple legitimate options exist, supplementary information should include analyses of them all, and raw data should be available to readers (see point 1).

6.     Institute standard practices for data peeking

Data peeking isn't necessarily bad, but if we do it then we need to correct for it. Uncorrected peeking runs riot in psychology and neuroimaging because the pressure to publish and the dependence of publication on significant results has made chasing p-values the norm. We can see it in other areas of science too. Take the Higgs. Following initial hints at 3-sigma last year, the physicists kept adding data until they reached 5-sigma. The fact that their alpha is so stringent in the first place provides reassurance that they have genuinely discovered something. But if they peeked and chased then it simply isn’t the 5-sigma discovery that was advertised. (As a side note: how about we ditch Fisher-based stats altogether and go Bayesian? That way we can actually test that pesky null hypothesis)

7.     Officially recognise quality of publications over quantity

Everyone agrees that quality of publications is paramount, but we still chase quantity and value ‘prolific’ researchers. So how about setting a cap on the number of publications each researcher or lab can publish per year? That way we would truly have an incentive to make sure of results before publishing them. It would also encourage us to publish single papers with multiple experiments and more definitive conclusions.

8.     Ditch impact factor and let us never speak of it again

As scientists who purportedly know something about numbers, we should be collectively ashamed of ourselves for being conned by journal impact factors (IF). Nowhere is the ludicrous doublethink of the IF culture more apparent than in the current REF, where the advice from universities amounts to “IF of journals is not taken into account in assessing quality of your REF submissions” while simultaneously advising us to “ensure that your four submissions are from the highest impact journals”. Complete with helpful departmental emails reminding us which journals are going up in IF (which is all of them as far as I can tell), the situation really is quite stupid and embarrassing. Here’s a fact shown by Bjorn Brembs: IF correlates better with retraction rate than citation rate. We should replace IF with article-specific merits such as post-publication ratings, article citation count, or – shock horror – considered assessment of the article after reading the damn thing.

9. Open access publication

Much has been said and written in the last few days about open access, with the Government making important steps toward an open scientific future in the UK (I recommend following the blogs of Stephen Curry and Mike Taylor for the latest developments and analysis).  For my part, I think the sooner we eliminate corporate publishers the better. I simply don’t see what value they add when all of the reviewing and editing is done by us at zero cost.

10. Stop conflating research inputs with research outputs

Getting a research grant is great, but we need to stop counting grants as outputs. They are inputs. We need to start assessing the quality of science by balancing outputs against inputs, not by adding them together.

11. Rethink authorship

Academic authorship is antiquated and not designed for collaborative teams. By rank-ordering authors from first to last, we make it impossible for multiple co-authors to make a genuinely equal contribution (Ah, I hear you cry, what about that little asterisk that flags equal contributions? Well, sorry, but…um…nobody really takes much notice of those).

I think a better approach would be to list authors alphabetically on all papers and simply assign % contributions to different areas, such as experimental design, analysis, data collection, interpretation of results, and manuscript preparation. Some journals already do this in some form, but I would like to see this completely replace the current form of authorship.

12. Revise the peer review system

Independent peer review may the best mechanism we currently have for triaging science, but it still sucks. For one thing, it’s usually not independent. I often get asked to review papers by scientists I know or have even worked with. I’ve even been asked to review my own papers on occasion, and was once asked to review my own grant application! (You’ll be glad to know I declined all such instances of self-review). The review process is random and noisy, and based on such a pitifully small sample of comments that the notion of it providing meaningful information is, statistically speaking, quite ridiculous. 

I personally favour the idea of cutting down on the number of detailed reviewers per manuscript and instead calling on a larger number of ‘speed reviewers’, who would simply rate the paper according to various criteria, without having to write any comments. As a reviewer, I often find that I can form an opinion of an article relatively quickly – it is writing the review that takes the most time.

Last week, Paul Knoepfler wrote a provocative blog post proposing an innovation in peer review in which authors review the reviewers. Could this help improve quality of reviews? Unfortunately, I don’t think Paul’s system would work (see my comment on his post here), but perhaps some kind of independent meta-review of reviewers could also be a good idea in a limited number of cases. 

What do you think? Got better ideas? Please leave any comments below. 

** Update 18/7/12, 14:30: On the issue of the gender imbalance in academic fraud, Mark Baxter has kindly reminded me of this case involving Karen M. Ruggiero.