In search of lost infinities
What is the "n" in big data?
Stephen Senn, Consultant Statistician, Edinburgh
In designing complex experiments, agricultural scientists, with the help of their statistician collaborators, soon came to
realise that variation at different levels had very different consequences for estimating different treatment effects,
depending on how the treatments were mapped onto the underlying block structure. This was a key feature of
the Rothamsted approach to design and analysis and a strong thread running through the work of Fisher, Yates and Nelder,
being expressed in topics such as split-pot designs, recovering inter-block information and fractional factorials.
The null block-structure of an experiment is key to this philosophy of design and analysis. However, modern techniques for
analysing experiments stress models rather than symmetries and this approach requires much greater care in analysis with the
consequence that you can easily make mistakes and often will.
In this talk I shall underline the obvious, but often unintentionally overlooked, fact that understanding
variation at the various levels at which it occurs is crucial to analysis. I shall take three examples, an application
of John Nelder's theory of general balance to Lord's Paradox, the use of historical data in drug development and a hybrid
randomised non-randomised clinical trial, the TARGET study, to show that data that many, including those promoting
a so-called causal revolution, assume to be 'big', may actually be rather 'small'. The consequence is that there is a danger
that the size of standard errors will be underestimated or even that the appropriate regression coefficients for adjusting for
confounding may not be identified correctly.
I conclude that an old but powerful experimental design approach holds important lessons for observational data
about limitations in interpretation that mere numbers cannot overcome. Small may be beautiful, after all.