The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities.
Evaluation of OpenAI’s GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters finds that model performance and calibration both improve with scale, but are poor in absolute terms.
For this paper, I worked on curating MinuteMysteriesQA, a dataset of short ”Whodunnit?” crime stories with the aim to evaluate the capability of models to perform complex reasoning. The dataset has been accepted to BIG-Bench, which is currently on Arxiv. The dataset can be found here
- Srivastava, A. et al (2022). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. ArXiv, abs/2206.04615.