十年数据误区:DuckDB 揭开“大数据”幻象

2025年05月25日20:42:20 科技 1061

十年数据误区:DuckDB 揭开“大数据”幻象 - 天天要闻

提起“大数据”,许多人脑海中浮现的第一反应无非是:分布式、Hadoop、数据仓库及复杂的数据湖架构。十多年来,这些词汇塑造了我们对数据处理的认知,也引导了整个行业的基础设施选择。然而,DuckDB 却直言——我们被“大数据”忽悠了整整十年。

到底是哪里出了问题?为什幺小数据时代反而被误判为“大数据”战场?DuckDB 官方博客在一篇名为《小数据的失落十年,The Lost Decade of Small Data?》(https://duckdb.org/2025/05/19/the-lost-decade-of-small-data.html)的文章中抛出了大胆质疑与深刻反思。

以下内容为翻译该文,带你一起重新审视那场“数据规模幻象”的十年旅程。


TL;DR: We benchmark DuckDB on a 2012 MacBook Pro to decide: did we lose a decade chasing distributed architectures for data analytics?

TL;DR:我们在 2012 年的 MacBook Pro 上对 DuckDB 进行了基准测试,以决定:我们是否在追逐数据分析的分布式架构上浪费了十年时间?

Much has been said, not in the very least by ourselves, about how data is actually not that “Big” and how the speed of hardware innovation is outpacing the growth of useful datasets. We may have gone so far to predict a data singularity in the near future, where 99% of useful datasets can be comfortably queried on a single node. As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.

关于数据其实并没有那么“大”,以及硬件创新速度如何超过有用数据集的增长速度,人们(尤其是我们自己)已经讨论过很多次。我们甚至可能预测在不久的将来会出现数据奇点,届时99%的有用数据集可以在单个节点上轻松查询。正如最近显示的那样,Amazon Redshift 和 Snowflake 中的中位数扫描读取的数据量为 100 MB,而 99.9% 的读取量不到 300 GB。因此,奇点可能比我们想象的更近。

But we started wondering, when did this development really start? When did personal computers like the ubiquitous MacBook Pro, usually condemned to running Chrome, become the data processing powerhouses that they really are today?

但我们不禁要问,这种发展究竟是从什么时候开始的?像 MacBook Pro 这样随处可见、通常只能运行 Chrome 浏览器的个人电脑,又是什么时候发展成为如今数据处理能力如此强大的?

Let's turn our attention to the 2012 Retina MacBook Pro, a computer many people (myself included) bought at the time because of its gorgeous “Retina” display. Millions were sold. Despite being unemployed at the time, I had even splurged for the 16 GB RAM upgrade. But there was another often-forgotten revolutionary change in this machine: it was the first MacBook with a built-in Solid-State Disk (SSD) and a competitive 4-core 2.6 GHz “Core i7” CPU. It's funny to watch the announcement again, where they do stress the performance aspect of the “all-flash architecture” as well.

让我们把目光转向2012年的 Retina MacBook Pro,当时很多人(包括我自己)都因为它华丽的“Retina”显示屏而购买了这款电脑。它销量达数百万台。尽管当时我失业了,但我甚至挥霍了一笔钱升级了16GB的内存。但这台电脑还有另一个经常被遗忘的革命性变化:它是第一款内置固态硬盘 (SSD)和极具竞争力的4核2.6 GHz“Core i7” CPU的MacBook。再次观看发布会的画面很有意思,他们也确实强调了“全闪存架构”的性能。

十年数据误区:DuckDB 揭开“大数据”幻象 - 天天要闻

Side note: the MacBook Air was actually the first MacBook with an (optional) built-in SSD already back in 2008. But it did not have the CPU firepower of the Pro, sadly.

附注:MacBook Air 实际上是第一款配备(可选)内置 SSD 的 MacBook,早在 2008 年就已推出。但遗憾的是,它不具备 Pro 的 CPU 火力。

Coincidentally, I still have this laptop in the DuckDB Labs office, currently used by my kids to type their names in a massive font size or watch Bluey on YouTube when they're around. But can this relic still run modern-day DuckDB? How will its performance compare to modern MacBooks? And could we have had the data revolution that we are seeing now already back in 2012? Let's find out!

巧合的是,我的DuckDB 实验室办公室里还有这台笔记本电脑,孩子们现在用它来用大字体输入他们的名字,或者在他们身边的时候在 YouTube 上看《Bluey》节目。但这台旧电脑还能运行现代的 DuckDB 吗?它的性能与现代 MacBook 相比如何?我们可能在 2012 年就经历了如今的数据革命吗?让我们来一探究竟!

软件(Software)

First, what about the operating system? In order to make the comparison fair(er) to the decades, we actually downgraded the operating system on the Retina to OS X 10.8.5 “Mountain Lion”, the operating system version that shipped just a few weeks after the laptop itself in July 2012. Even though the Retina can actually run 10.15 (Catalina), we felt a true 2012 comparison should also use an operating system from the era. Below is a screenshot of the user interface for those of us who sometimes feel a little old.

首先,操作系统怎么样?为了让对比更公平地反映十年前的差异,我们实际上将 Retina 的操作系统降级到了 OS X 10.8.5 “Mountain Lion”,这个版本是在 2012 年 7 月,比这款笔记本电脑发布仅晚了几周才发布的。虽然 Retina 可以运行 10.15(Catalina),但我们认为,要真正与 2012 年的对比,也应该使用那个时代的操作系统。下面是用户界面的截图,方便我们这些有时会觉得有些过时的人查看。

十年数据误区:DuckDB 揭开“大数据”幻象 - 天天要闻

Moving on to DuckDB itself: here at DuckDB we are more than a little religious about portability and dependencies – or rather the lack thereof. This means that very little had to happen to make DuckDB run on the ancient Mountain Lion: the stock DuckDB binary is built with by default with backwards-compatibility to OS X 11.0 (Big Sur), but simply changing the flag and recompiling turned out to be enough to make DuckDB 1.2.2 run on Mountain Lion. We would have loved to also use a 2012 compiler to build DuckDB, but, alas, C++ 11 was unsurprisingly simply too new in 2012 to be fully supported by compilers. Either way, the binary runs fine and could have been also produced by working around the compiler bugs. Or we could have just hand-coded Assembly like others have done.

继续讨论 DuckDB 本身:在 DuckDB,我们对可移植性和依赖性非常执着 —— 或者更确切地说,是缺乏可移植性和依赖性。这意味着几乎不需要做任何改动就可以让 DuckDB 在古老的 Mountain Lion 上运行:默认构建的 DuckDB 二进制文件向后兼容 OS X 11.0 (Big Sur),但只需更改标志并重新编译就足以让 DuckDB 1.2.2 在 Mountain Lion 上运行。我们也希望使用 2012 年的编译器来构建 DuckDB,但遗憾的是,C++ 11在 2012 年太新了,编译器无法完全支持它。无论如何,二进制文件运行良好,也可以通过解决编译器错误来生成。或者我们可以像其他人一样手工编写汇编代码。

基准测试(Benchmarks)

But we're not interested in synthetic CPU scores, we're interested in synthetic SQL scores instead! To see how the old machine is holding up when performing serious data crunching, we used the at this point rather tired but well-known TPC-H benchmark at scale factor 1000. This means that the two main tables, lineitem and orders contain 6 and 1.5 Billion rows, respectively. When stored as a DuckDB database, the database has a size of ca. 265 GB.

但我们感兴趣的不是综合 CPU 得分,而是综合 SQL 得分!为了测试这台老机器在执行大规模数据处理时的表现,我们使用了目前虽然略显老旧但广为人知的 TPC-H 基准测试,其规模因子为 1000。这意味着两个主表lineitem和orders分别包含 60 亿行和 15 亿行数据。当存储为 DuckDB 数据库时,该数据库的大小约为 265 GB。

From the audited results on the TPC website, we can see that running the benchmark on this scale factor on a single node seems to require hardware costing hundreds of thousands of Dollars.

从TPC网站上的审计结果我们可以看出,在单个节点上运行这个规模的基准测试似乎需要花费数十万美元的硬件。

We ran each of the 22 benchmark queries five times, and took the median runtime to remove noise. However, because the amount of RAM (16 GB) is very much less than the database size (256 GB), no significant amount of the input data can be cached in the buffer manager, so those are not really what people sometimes call “hot” runs.

我们对 22 个基准查询分别运行了五次,并取中位运行时间以消除干扰。然而,由于 RAM 大小(16 GB)远小于数据库大小(256 GB),缓冲区管理器中无法缓存大量的输入数据,因此这些运行实际上并非人们所说的“热”运行。

十年数据误区:DuckDB 揭开“大数据”幻象 - 天天要闻

Below are the per-query results in seconds:

以下是每个查询的结果(以秒为单位):

TPC-H-SQL

time

1

142.2

2

23.2

3

262.7

4

167.5

5

185.9

6

127.7

7

278.3

8

248.4

9

675.0

10

1266.1

11

33.4

12

161.7

13

384.7

14

215.9

15

197.6

16

100.7

17

243.7

18

2076.1

19

283.9

20

200.1

21

1011.9

22

57.7

But what do those cold numbers actually mean? The hidden sensation is that we actually have numbers, this old computer could actually complete all benchmark queries using DuckDB! If we look at the time a bit closer, we see the queries take anywhere between a minute and half an hour. Those are not unreasonable waiting times for analytical queries on that sort of data in any way. Heck, you would have been waiting way longer back in 2012 for Hadoop YARN to pick up your job in the first place only to spew stack traces at you at some point.

但这些冰冷的数字究竟意味着什么?我们内心深处的感受是,我们真的掌握了数据,这台老电脑居然能用 DuckDB 完成所有基准测试查询!如果我们仔细观察一下时间,就会发现查询耗时在一分钟到半小时之间。对于这类数据的分析查询来说,这样的等待时间并不算不合理。哎呀,要是在 2012 年,你肯定要等更长时间才能等到 Hadoop YARN 接手你的任务,结果却在某个时刻向你喷涌而出的堆栈跟踪信息。

2023 年改进(Improvements)

But how do those results stack up against a modern MacBook? As a comparison point, we used a modern ARM-based M3 Max MacBook Pro, which happened to be sitting on the same desk. But between them, the two MacBooks represent more than a decade of hardware development.

但这些结果与现代 MacBook 相比如何呢?为了进行比较,我们使用了一台现代的 ARM M3 Max MacBook Pro,它恰好放在同一张桌子上。但这两款 MacBook 加起来代表了十多年的硬件发展历程。

Looking at GeekBench 5 benchmark scores alone, we see a ca. 7× difference in raw CPU speed when using all cores, and ca. factor 3 difference in single-core speed. Of course there are also big differences in RAM and SSD speeds. Funnily, the display size and resolution are almost unchanged.

仅从GeekBench 5 的基准测试成绩来看,我们发现在使用所有核心的情况下,原始 CPU 速度大约相差 7 倍,单核速度大约相差 3 倍。当然,RAM 和 SSD 的速度也有很大差异。有趣的是,显示屏尺寸和分辨率几乎没有变化。

Here are the results side-by-side:

以下是并排的结果:

TPC-H-SQL

time_2012

time_2023

加速倍数

1

142.2

19.6

7.26

2

23.2

2.0

11.60

3

262.7

21.8

12.05

4

167.5

11.1

15.09

5

185.9

15.5

11.99

6

127.7

6.6

19.35

7

278.3

14.9

18.68

8

248.4

14.5

17.13

9

675.0

33.3

20.27

10

1266.1

23.6

53.65

11

33.4

2.2

15.18

12

161.7

10.1

16.01

13

384.7

24.4

15.77

14

215.9

9.2

23.47

15

197.6

8.2

24.10

16

100.7

4.1

24.56

17

243.7

15.3

15.93

18

2076.1

47.6

43.62

19

283.9

23.1

12.29

20

200.1

10.9

18.36

21

1011.9

47.8

21.17

22

57.7

4.3

13.42

We do see significant speedups, from 7 up to as much as 53. The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.

我们确实看到了显著的加速,从 7 上升到 53。时间的几何平均值从 218 提高到 12,提高了约 20 倍。

复现(Reproducibility)

The binary, scripts, queries, and results are available on GitHub for inspection. We also made the TPC-H SF1000 database file available for download so you don't have to generate it. But be warned, it's a large file.

二进制文件、脚本、查询和结果均可在 GitHub 上查看。我们还提供了TPC-H SF1000 数据库文件供下载,您无需自行生成。但请注意,该文件很大。

  • https://github.com/hannes/old-macbook-tpch
  • http://blobs.duckdb.org/data/tpch-sf1000.db

讨论(Discussion)

We have seen how the decade-old MacBook Pro Retina has been able to complete a complex analytical benchmark. A newer laptop was able to significantly improve on those times. But absolute speedup numbers are a bit pointless here. The difference is purely quantitative, not qualitative.

我们已经见证了十年前的 MacBook Pro Retina 是如何完成一项复杂的分析基准测试的。新款笔记本电脑能够显著提升这些时间。但绝对的加速数字在这里毫无意义。差异纯粹是量变,而非质变。

From a user perspective, it matters much more that those queries complete in somewhat reasonable time, not if it took 10 or 100 seconds to do so. We can tackle almost the same kind of data problems with both laptops, we just have to be willing to wait a little longer. This is especially true given DuckDB's out-of-core capability, which allows it to spill query intermediates to disks if required.

从用户的角度来看,查询在合理的时间内完成比耗时 10 秒或 100 秒更重要。我们可以用两台笔记本电脑处理几乎相同的数据问题,只是需要愿意多等一会儿。考虑到 DuckDB 的核外(out-of-core)功能,这一点尤其重要,该功能允许它在需要时将查询中间结果溢出到磁盘。

What is perhaps more interesting is that back in 2012, it would have been completely feasible to have a single-node SQL engine like DuckDB that could run complex analytical SQL queries against a database of 6 billion rows in manageable time – and we did not even have to immerse it in dry ice this time.

更有趣的是,早在 2012 年,拥有像 DuckDB 这样的单节点 SQL 引擎是完全可行的,它可以在可控的时间内对 60 亿行的数据库运行复杂的分析 SQL 查询——而这次我们甚至不必将其浸入干冰中。

History is full of “what if”s, what if something like DuckDB had existed in 2012? The main ingredients were there, vectorized query processing had already been invented in 2005. Would the now somewhat-silly-looking move to distributed systems for data analysis have ever happened? The dataset size of our benchmark database was awfully close to that 99.9% percentile of input data volume for analytical queries in 2024. And while the retina MacBook Pro was a high-end machine in 2012, by 2014 many other vendors shifted to offering laptops with built-in SSD storage and larger amounts of memory became more widespread.

历史充满了“如果”,如果像 DuckDB 这样的东西在 2012 年就存在会怎样?主要要素都具备,矢量化查询处理早在 2005 年就已发明。现在看来有些愚蠢的数据分析分布式系统转型真的会发生吗?我们基准数据库的数据集大小非常接近 2024 年分析查询输入数据量的 99.9%。虽然视网膜 MacBook Pro 在 2012 年是一款高端电脑,但到了 2014 年,许多其他厂商都转向提供内置 SSD 存储的笔记本电脑,大容量内存也变得更加普及。

So, yes, we really did lose a full decade.

所以,是的,我们确实失去了整整十年。

科技分类资讯推荐

苹果官宣降价:市场策略与行业影响的双重奏 - 天天要闻

苹果官宣降价:市场策略与行业影响的双重奏

在消费电子领域,苹果公司的一举一动都备受瞩目。近日,苹果官宣降价的消息如同一颗石子投入平静的湖面,激起层层涟漪,在消费者、市场以及整个行业中引发了广泛的关注与热议。一、降价举措的背后逻辑苹果公司此次官宣降价,并非毫无征兆的突发奇想,而是多种
卷出天花板的电混旗舰?吉利银河星耀8试驾后,我有些话想说 - 天天要闻

卷出天花板的电混旗舰?吉利银河星耀8试驾后,我有些话想说

5月9日,吉利银河星耀8正式上市,仅几天时间,话题热度就迅速攀升。不夸张地说,不少车友群、朋友圈几乎被这台车“刷了屏”。身边很多朋友也跑来问我:这车真有说的那么牛吗?值得入手吗?带着这个疑问,我参加了在成都举行的区域上市品鉴会暨试驾活动。试完之后,我的第一反应
演绎科技浪漫 “手搓”搓出一个全宇宙 - 天天要闻

演绎科技浪漫 “手搓”搓出一个全宇宙

一个从小爱拆车、没考上大学的孩子,长大后在干什么?答案是在云南农村造车,把科幻片里的车变成现实。云南昆明山区的农家院里,抖音创作者@猪坚强 的“床车系列”正在刷新认知,能爬楼的履带车、水陆两栖的变形车、带 AI 语音控制的车。
洞察生意本质,开启财富密码 - 天天要闻

洞察生意本质,开启财富密码

在当今竞争激烈的商业世界中,人人都渴望抓住生意的本质,实现财富的增长。但究竟什么才是生意的本质?又该如何凭借它来赚钱呢?生意的本质,说白了就是满足需求。就像苹果公司,他们洞察到人们对于简洁、美观且功能强大的电子产品的需求,于是推出了一系列惊
小米高端机大跳水,16GB+512GB+60倍变焦,降幅高达2050元 - 天天要闻

小米高端机大跳水,16GB+512GB+60倍变焦,降幅高达2050元

如果你的预算在3000元左右,你会选择新款中端旗舰还是会选择某些大跳水的老款高端手机呢?相信不同的人会有不同的决定。如果你比较侧重的是游戏体验,对拍照的要求不是很高,相信不少网友都会选择新款中端旗舰....
一季度华为小米手机销量大增 苹果再下滑 - 天天要闻

一季度华为小米手机销量大增 苹果再下滑

时间来到了5月末,除了华为的影像旗舰Pura 80系列外,各家厂商都已经完成了这一轮的旗舰机型发布。就在近日,Counterpoint带来了2025年第一季度的手机销量统计。各家厂商表现如何?一起来看看。先来看全球市场方面的表现,据报告介绍,2025年Q1全球智能手机市场收入同比增长3%,主要由Apple、vivo及非排名前五的品牌推动。...
苹果更新过时产品列表,看看有没有你在用的? - 天天要闻

苹果更新过时产品列表,看看有没有你在用的?

据了解,苹果每隔一段时间就会更新一份“过时产品”名单,包含一些已经停止销售多年且不再提供官方维修支持的设备。据悉,苹果会在产品停售5年-7年后标记为过时产品,这意味着这些产品可能无法获得苹果官方的维修服务,但如果仍有零件可用,苹果会提供最多两年时间的维修服务。而在苹果最新发布的过时产品列表中,我们发现...
曝低轨卫星通信开始公测:华为Mate X6典藏版首发 - 天天要闻

曝低轨卫星通信开始公测:华为Mate X6典藏版首发

据博主@数码闲聊站 最新透露,目前低轨卫星通信系统已经进入了公测阶段,顺利的话,下半年有望实现消费级卫星组网,在无网络覆盖的区域提供网络连接。参考现有资料,按照卫星轨道高度的不同,通信卫星可以分为低轨通卫星、中轨卫星和高轨卫星。相比于北斗、天通等中高轨道卫星,低轨卫星由于距离地球表面更近,拥有独特优势...
两款真我新机下周发,有一款为限定版机型 - 天天要闻

两款真我新机下周发,有一款为限定版机型

据悉,真我Neo7 Turbo将于5月29日14:00正式发布,目前新机的两款外观已经公布。真我官方表示。真我Neo7 Turbo拥有透明灰和透明黑两种配色,手机背面配备闪能DART标、NFC灵透线圈、背板晶刻纹理,弧形线圈清晰可见,纹理隐隐若现,科技感拉满。真我 realme 副总裁、全球营销总裁、中国区总裁徐起透露,“在透明后盖之下,是...