十年數據誤區:DuckDB 揭開「大數據」幻象

提起「大數據」,許多人腦海中浮現的第一反應無非是:分佈式、Hadoop、數據倉庫及複雜的數據湖架構。十多年來,這些詞彙塑造了我們對數據處理的認知,也引導了整個行業的基礎設施選擇。然而,DuckDB 卻直言——我們被「大數據」忽悠了整整十年。

到底是哪裡出了問題?為什麼小數據時代反而被誤判為「大數據」戰場?DuckDB 官方博客在一篇名為《小數據的失落十年,The Lost Decade of Small Data?》(https://duckdb.org/2025/05/19/the-lost-decade-of-small-data.html)的文章中拋出了大膽質疑與深刻反思。

以下內容為翻譯該文,帶你一起重新審視那場「數據規模幻象」的十年旅程。


TL;DR: We benchmark DuckDB on a 2012 MacBook Pro to decide: did we lose a decade chasing distributed architectures for data analytics?

TL;DR:我們在 2012 年的 MacBook Pro 上對 DuckDB 進行了基準測試,以決定:我們是否在追逐數據分析的分佈式架構上浪費了十年時間?

Much has been said, not in the very least by ourselves, about how data is actually not that 「Big」 and how the speed of hardware innovation is outpacing the growth of useful datasets. We may have gone so far to predict a data singularity in the near future, where 99% of useful datasets can be comfortably queried on a single node. As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.

關於數據其實並沒有那麼「大」,以及硬件創新速度如何超過有用數據集的增長速度,人們(尤其是我們自己)已經討論過很多次。我們甚至可能預測在不久的將來會出現數據奇點,屆時99%的有用數據集可以在單個節點上輕鬆查詢。正如最近顯示的那樣,Amazon Redshift 和 Snowflake 中的中位數掃描讀取的數據量為 100 MB,而 99.9% 的讀取量不到 300 GB。因此,奇點可能比我們想像的更近。

But we started wondering, when did this development really start? When did personal computers like the ubiquitous MacBook Pro, usually condemned to running Chrome, become the data processing powerhouses that they really are today?

但我們不禁要問,這種發展究竟是從什麼時候開始的?像 MacBook Pro 這樣隨處可見、通常只能運行 Chrome 瀏覽器的個人電腦,又是什麼時候發展成為如今數據處理能力如此強大的?

Let's turn our attention to the 2012 Retina MacBook Pro, a computer many people (myself included) bought at the time because of its gorgeous 「Retina」 display. Millions were sold. Despite being unemployed at the time, I had even splurged for the 16 GB RAM upgrade. But there was another often-forgotten revolutionary change in this machine: it was the first MacBook with a built-in Solid-State Disk (SSD) and a competitive 4-core 2.6 GHz 「Core i7」 CPU. It's funny to watch the announcement again, where they do stress the performance aspect of the 「all-flash architecture」 as well.

讓我們把目光轉向2012年的 Retina MacBook Pro,當時很多人(包括我自己)都因為它華麗的「Retina」顯示屏而購買了這款電腦。它銷量達數百萬台。儘管當時我失業了,但我甚至揮霍了一筆錢升級了16GB的內存。但這台電腦還有另一個經常被遺忘的革命性變化:它是第一款內置固態硬盤 (SSD)和極具競爭力的4核2.6 GHz「Core i7」 CPU的MacBook。再次觀看發佈會的畫面很有意思,他們也確實強調了「全閃存架構」的性能。

Side note: the MacBook Air was actually the first MacBook with an (optional) built-in SSD already back in 2008. But it did not have the CPU firepower of the Pro, sadly.

附註:MacBook Air 實際上是第一款配備(可選)內置 SSD 的 MacBook,早在 2008 年就已推出。但遺憾的是,它不具備 Pro 的 CPU 火力。

Coincidentally, I still have this laptop in the DuckDB Labs office, currently used by my kids to type their names in a massive font size or watch Bluey on YouTube when they're around. But can this relic still run modern-day DuckDB? How will its performance compare to modern MacBooks? And could we have had the data revolution that we are seeing now already back in 2012? Let's find out!

巧合的是,我的DuckDB 實驗室辦公室里還有這台筆記本電腦,孩子們現在用它來用大字體輸入他們的名字,或者在他們身邊的時候在 YouTube 上看《Bluey》節目。但這台舊電腦還能運行現代的 DuckDB 嗎?它的性能與現代 MacBook 相比如何?我們可能在 2012 年就經歷了如今的數據革命嗎?讓我們來一探究竟!

軟件(Software)

First, what about the operating system? In order to make the comparison fair(er) to the decades, we actually downgraded the operating system on the Retina to OS X 10.8.5 「Mountain Lion」, the operating system version that shipped just a few weeks after the laptop itself in July 2012. Even though the Retina can actually run 10.15 (Catalina), we felt a true 2012 comparison should also use an operating system from the era. Below is a screenshot of the user interface for those of us who sometimes feel a little old.

首先,操作系統怎麼樣?為了讓對比更公平地反映十年前的差異,我們實際上將 Retina 的操作系統降級到了 OS X 10.8.5 「Mountain Lion」,這個版本是在 2012 年 7 月,比這款筆記本電腦發佈僅晚了幾周才發佈的。雖然 Retina 可以運行 10.15(Catalina),但我們認為,要真正與 2012 年的對比,也應該使用那個時代的操作系統。下面是用戶界面的截圖,方便我們這些有時會覺得有些過時的人查看。

Moving on to DuckDB itself: here at DuckDB we are more than a little religious about portability and dependencies – or rather the lack thereof. This means that very little had to happen to make DuckDB run on the ancient Mountain Lion: the stock DuckDB binary is built with by default with backwards-compatibility to OS X 11.0 (Big Sur), but simply changing the flag and recompiling turned out to be enough to make DuckDB 1.2.2 run on Mountain Lion. We would have loved to also use a 2012 compiler to build DuckDB, but, alas, C++ 11 was unsurprisingly simply too new in 2012 to be fully supported by compilers. Either way, the binary runs fine and could have been also produced by working around the compiler bugs. Or we could have just hand-coded Assembly like others have done.

繼續討論 DuckDB 本身:在 DuckDB,我們對可移植性和依賴性非常執着 —— 或者更確切地說,是缺乏可移植性和依賴性。這意味着幾乎不需要做任何改動就可以讓 DuckDB 在古老的 Mountain Lion 上運行:默認構建的 DuckDB 二進制文件向後兼容 OS X 11.0 (Big Sur),但只需更改標誌並重新編譯就足以讓 DuckDB 1.2.2 在 Mountain Lion 上運行。我們也希望使用 2012 年的編譯器來構建 DuckDB,但遺憾的是,C++ 11在 2012 年太新了,編譯器無法完全支持它。無論如何,二進制文件運行良好,也可以通過解決編譯器錯誤來生成。或者我們可以像其他人一樣手工編寫彙編代碼。

基準測試(Benchmarks)

But we're not interested in synthetic CPU scores, we're interested in synthetic SQL scores instead! To see how the old machine is holding up when performing serious data crunching, we used the at this point rather tired but well-known TPC-H benchmark at scale factor 1000. This means that the two main tables, lineitem and orders contain 6 and 1.5 Billion rows, respectively. When stored as a DuckDB database, the database has a size of ca. 265 GB.

但我們感興趣的不是綜合 CPU 得分,而是綜合 SQL 得分!為了測試這台老機器在執行大規模數據處理時的表現,我們使用了目前雖然略顯老舊但廣為人知的 TPC-H 基準測試,其規模因子為 1000。這意味着兩個主表lineitem和orders分別包含 60 億行和 15 億行數據。當存儲為 DuckDB 數據庫時,該數據庫的大小約為 265 GB。

From the audited results on the TPC website, we can see that running the benchmark on this scale factor on a single node seems to require hardware costing hundreds of thousands of Dollars.

從TPC網站上的審計結果我們可以看出,在單個節點上運行這個規模的基準測試似乎需要花費數十萬美元的硬件。

We ran each of the 22 benchmark queries five times, and took the median runtime to remove noise. However, because the amount of RAM (16 GB) is very much less than the database size (256 GB), no significant amount of the input data can be cached in the buffer manager, so those are not really what people sometimes call 「hot」 runs.

我們對 22 個基準查詢分別運行了五次,並取中位運行時間以消除干擾。然而,由於 RAM 大小(16 GB)遠小於數據庫大小(256 GB),緩衝區管理器中無法緩存大量的輸入數據,因此這些運行實際上並非人們所說的「熱」運行。

Below are the per-query results in seconds:

以下是每個查詢的結果(以秒為單位):

TPC-H-SQL

time

1

142.2

2

23.2

3

262.7

4

167.5

5

185.9

6

127.7

7

278.3

8

248.4

9

675.0

10

1266.1

11

33.4

12

161.7

13

384.7

14

215.9

15

197.6

16

100.7

17

243.7

18

2076.1

19

283.9

20

200.1

21

1011.9

22

57.7

But what do those cold numbers actually mean? The hidden sensation is that we actually have numbers, this old computer could actually complete all benchmark queries using DuckDB! If we look at the time a bit closer, we see the queries take anywhere between a minute and half an hour. Those are not unreasonable waiting times for analytical queries on that sort of data in any way. Heck, you would have been waiting way longer back in 2012 for Hadoop YARN to pick up your job in the first place only to spew stack traces at you at some point.

但這些冰冷的數字究竟意味着什麼?我們內心深處的感受是,我們真的掌握了數據,這台老電腦居然能用 DuckDB 完成所有基準測試查詢!如果我們仔細觀察一下時間,就會發現查詢耗時在一分鐘到半小時之間。對於這類數據的分析查詢來說,這樣的等待時間並不算不合理。哎呀,要是在 2012 年,你肯定要等更長時間才能等到 Hadoop YARN 接手你的任務,結果卻在某個時刻向你噴涌而出的堆棧跟蹤信息。

2023 年改進(Improvements)

But how do those results stack up against a modern MacBook? As a comparison point, we used a modern ARM-based M3 Max MacBook Pro, which happened to be sitting on the same desk. But between them, the two MacBooks represent more than a decade of hardware development.

但這些結果與現代 MacBook 相比如何呢?為了進行比較,我們使用了一台現代的 ARM M3 Max MacBook Pro,它恰好放在同一張桌子上。但這兩款 MacBook 加起來代表了十多年的硬件發展歷程。

Looking at GeekBench 5 benchmark scores alone, we see a ca. 7× difference in raw CPU speed when using all cores, and ca. factor 3 difference in single-core speed. Of course there are also big differences in RAM and SSD speeds. Funnily, the display size and resolution are almost unchanged.

僅從GeekBench 5 的基準測試成績來看,我們發現在使用所有核心的情況下,原始 CPU 速度大約相差 7 倍,單核速度大約相差 3 倍。當然,RAM 和 SSD 的速度也有很大差異。有趣的是,顯示屏尺寸和分辨率幾乎沒有變化。

Here are the results side-by-side:

以下是並排的結果:

TPC-H-SQL

time_2012

time_2023

加速倍數

1

142.2

19.6

7.26

2

23.2

2.0

11.60

3

262.7

21.8

12.05

4

167.5

11.1

15.09

5

185.9

15.5

11.99

6

127.7

6.6

19.35

7

278.3

14.9

18.68

8

248.4

14.5

17.13

9

675.0

33.3

20.27

10

1266.1

23.6

53.65

11

33.4

2.2

15.18

12

161.7

10.1

16.01

13

384.7

24.4

15.77

14

215.9

9.2

23.47

15

197.6

8.2

24.10

16

100.7

4.1

24.56

17

243.7

15.3

15.93

18

2076.1

47.6

43.62

19

283.9

23.1

12.29

20

200.1

10.9

18.36

21

1011.9

47.8

21.17

22

57.7

4.3

13.42

We do see significant speedups, from 7 up to as much as 53. The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.

我們確實看到了顯著的加速,從 7 上升到 53。時間的幾何平均值從 218 提高到 12,提高了約 20 倍。

復現(Reproducibility)

The binary, scripts, queries, and results are available on GitHub for inspection. We also made the TPC-H SF1000 database file available for download so you don't have to generate it. But be warned, it's a large file.

二進制文件、腳本、查詢和結果均可在 GitHub 上查看。我們還提供了TPC-H SF1000 數據庫文件供下載,您無需自行生成。但請注意,該文件很大。

  • https://github.com/hannes/old-macbook-tpch
  • http://blobs.duckdb.org/data/tpch-sf1000.db

討論(Discussion)

We have seen how the decade-old MacBook Pro Retina has been able to complete a complex analytical benchmark. A newer laptop was able to significantly improve on those times. But absolute speedup numbers are a bit pointless here. The difference is purely quantitative, not qualitative.

我們已經見證了十年前的 MacBook Pro Retina 是如何完成一項複雜的分析基準測試的。新款筆記本電腦能夠顯著提升這些時間。但絕對的加速數字在這裡毫無意義。差異純粹是量變,而非質變。

From a user perspective, it matters much more that those queries complete in somewhat reasonable time, not if it took 10 or 100 seconds to do so. We can tackle almost the same kind of data problems with both laptops, we just have to be willing to wait a little longer. This is especially true given DuckDB's out-of-core capability, which allows it to spill query intermediates to disks if required.

從用戶的角度來看,查詢在合理的時間內完成比耗時 10 秒或 100 秒更重要。我們可以用兩台筆記本電腦處理幾乎相同的數據問題,只是需要願意多等一會兒。考慮到 DuckDB 的核外(out-of-core)功能,這一點尤其重要,該功能允許它在需要時將查詢中間結果溢出到磁盤。

What is perhaps more interesting is that back in 2012, it would have been completely feasible to have a single-node SQL engine like DuckDB that could run complex analytical SQL queries against a database of 6 billion rows in manageable time – and we did not even have to immerse it in dry ice this time.

更有趣的是,早在 2012 年,擁有像 DuckDB 這樣的單節點 SQL 引擎是完全可行的,它可以在可控的時間內對 60 億行的數據庫運行複雜的分析 SQL 查詢——而這次我們甚至不必將其浸入乾冰中。

History is full of 「what if」s, what if something like DuckDB had existed in 2012? The main ingredients were there, vectorized query processing had already been invented in 2005. Would the now somewhat-silly-looking move to distributed systems for data analysis have ever happened? The dataset size of our benchmark database was awfully close to that 99.9% percentile of input data volume for analytical queries in 2024. And while the retina MacBook Pro was a high-end machine in 2012, by 2014 many other vendors shifted to offering laptops with built-in SSD storage and larger amounts of memory became more widespread.

歷史充滿了「如果」,如果像 DuckDB 這樣的東西在 2012 年就存在會怎樣?主要要素都具備,矢量化查詢處理早在 2005 年就已發明。現在看來有些愚蠢的數據分析分佈式系統轉型真的會發生嗎?我們基準數據庫的數據集大小非常接近 2024 年分析查詢輸入數據量的 99.9%。雖然視網膜 MacBook Pro 在 2012 年是一款高端電腦,但到了 2014 年,許多其他廠商都轉向提供內置 SSD 存儲的筆記本電腦,大容量內存也變得更加普及。

So, yes, we really did lose a full decade.

所以,是的,我們確實失去了整整十年。