Look, Google's latest paper - Beyond The Imitation Game: Quantifying And Extrapolating The Capabilities Of Language Models.

2025/05/0204:06:34 hotcomm 1106

Baijiao from Aofei Temple
qubit | Official account QbitAI

A AI paper, 442 authors.

also has a chapter dedicated to the writer's contribution. More than half of the pages of

00 are references...

is not, are such papers popular now?

, Google 's latest paper - Beyond The Imitation Game: Quantifying And Extrapolating The Capabilities Of Language Models.

So the author's column became like this...

Research scholars from 132 institutions took two years to propose a new benchmark for a large language model BIG-bench.

and on this basis, the GPT model of OpenAI, Google-internal dense transformer architecture, etc. were evaluated, and the model scale was 6 orders of magnitude. The final result of

shows that although the model performance improves with the expansion of scale, it is still far from that of humans.

For this work, Jeff Dean forward and like: Great Work.

new benchmark for the big language model

What exactly does Kangkang's paper say?

As the scale of the scale expands, the performance and quality of the model have improved to a certain extent. There may be some transformational impacts in this, but these performances have not been well described before. Some existing benchmarks in

have certain limitations, the evaluation range is relatively narrow, and the performance scores quickly reach saturation.

, such as SuperGLUE, has achieved "over human-level" performance within 18 months after the benchmark was launched.

BIG-bench was born based on this background.

Currently it consists of 204 tasks, covering issues such as linguistics, child development, mathematics, common sense reasoning, biology, physics, social bias, software development, etc.

In addition, there is a human expert jury that also performed all tasks to provide a baseline level.

To facilitate more institutions, the researchers also gave BIG-bench Lite, a small but representative subset of tasks that facilitate faster evaluation.

and open source code implementing the benchmark API, support task evaluation on publicly available models, and lightweight creation of new tasks. The final evaluation results of

can be seen that the scale spans six orders of magnitude, and the overall performance on BIG-bench increases with the expansion of the model scale and the increase in the number of training samples.

, but compared with the human baseline level, it still performs poorly.

Specifically for some tasks, the model performance will improve steadily with the increase in scale. But sometimes, breakthroughs suddenly appear on a specific scale.

In addition, it can also evaluate the social biases present in the model.

In addition, they also accidentally discovered that the model can also get some hidden skills. For example, how to move in chess in a rule.

Author contributions wrote 14 pages of

It is worth mentioning that perhaps because of too many authors, a chapter of the author's contributions were left in the end of the paper.

wrote 14 pages in a slew-inspired manner, including core contributors, Review, tasks providing...

, and 50 pages of reference.

is OK. Interested friends can click below to link Kangkang's paper.

paper link:
https://arxiv.org/abs/2206.04615
GitHub link:
https:// github.com/google/BIG-bench
Reference link:
https://twitter.com/jaschasd/status/1535055886913220608

— End —

Quantum bits QbitAI · Toutiao Sign