十年數據誤區:DuckDB 揭開「大數據」幻象

2025年05月25日20:42:20 科技 1061

十年數據誤區:DuckDB 揭開「大數據」幻象 - 天天要聞

提起「大數據」,許多人腦海中浮現的第一反應無非是:分散式、Hadoop、數據倉庫及複雜的數據湖架構。十多年來,這些辭彙塑造了我們對數據處理的認知,也引導了整個行業的基礎設施選擇。然而,DuckDB 卻直言——我們被「大數據」忽悠了整整十年。

到底是哪裡出了問題?為什麼小數據時代反而被誤判為「大數據」戰場?DuckDB 官方博客在一篇名為《小數據的失落十年,The Lost Decade of Small Data?》(https://duckdb.org/2025/05/19/the-lost-decade-of-small-data.html)的文章中拋出了大膽質疑與深刻反思。

以下內容為翻譯該文,帶你一起重新審視那場「數據規模幻象」的十年旅程。


TL;DR: We benchmark DuckDB on a 2012 MacBook Pro to decide: did we lose a decade chasing distributed architectures for data analytics?

TL;DR:我們在 2012 年的 MacBook Pro 上對 DuckDB 進行了基準測試,以決定:我們是否在追逐數據分析的分散式架構上浪費了十年時間?

Much has been said, not in the very least by ourselves, about how data is actually not that 「Big」 and how the speed of hardware innovation is outpacing the growth of useful datasets. We may have gone so far to predict a data singularity in the near future, where 99% of useful datasets can be comfortably queried on a single node. As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.

關於數據其實並沒有那麼「大」,以及硬體創新速度如何超過有用數據集的增長速度,人們(尤其是我們自己)已經討論過很多次。我們甚至可能預測在不久的將來會出現數據奇點,屆時99%的有用數據集可以在單個節點上輕鬆查詢。正如最近顯示的那樣,Amazon Redshift 和 Snowflake 中的中位數掃描讀取的數據量為 100 MB,而 99.9% 的讀取量不到 300 GB。因此,奇點可能比我們想像的更近。

But we started wondering, when did this development really start? When did personal computers like the ubiquitous MacBook Pro, usually condemned to running Chrome, become the data processing powerhouses that they really are today?

但我們不禁要問,這種發展究竟是從什麼時候開始的?像 MacBook Pro 這樣隨處可見、通常只能運行 Chrome 瀏覽器的個人電腦,又是什麼時候發展成為如今數據處理能力如此強大的?

Let's turn our attention to the 2012 Retina MacBook Pro, a computer many people (myself included) bought at the time because of its gorgeous 「Retina」 display. Millions were sold. Despite being unemployed at the time, I had even splurged for the 16 GB RAM upgrade. But there was another often-forgotten revolutionary change in this machine: it was the first MacBook with a built-in Solid-State Disk (SSD) and a competitive 4-core 2.6 GHz 「Core i7」 CPU. It's funny to watch the announcement again, where they do stress the performance aspect of the 「all-flash architecture」 as well.

讓我們把目光轉向2012年的 Retina MacBook Pro,當時很多人(包括我自己)都因為它華麗的「Retina」顯示屏而購買了這款電腦。它銷量達數百萬台。儘管當時我失業了,但我甚至揮霍了一筆錢升級了16GB的內存。但這台電腦還有另一個經常被遺忘的革命性變化:它是第一款內置固態硬碟 (SSD)和極具競爭力的4核2.6 GHz「Core i7」 CPU的MacBook。再次觀看發布會的畫面很有意思,他們也確實強調了「全快閃記憶體架構」的性能。

十年數據誤區:DuckDB 揭開「大數據」幻象 - 天天要聞

Side note: the MacBook Air was actually the first MacBook with an (optional) built-in SSD already back in 2008. But it did not have the CPU firepower of the Pro, sadly.

附註:MacBook Air 實際上是第一款配備(可選)內置 SSD 的 MacBook,早在 2008 年就已推出。但遺憾的是,它不具備 Pro 的 CPU 火力。

Coincidentally, I still have this laptop in the DuckDB Labs office, currently used by my kids to type their names in a massive font size or watch Bluey on YouTube when they're around. But can this relic still run modern-day DuckDB? How will its performance compare to modern MacBooks? And could we have had the data revolution that we are seeing now already back in 2012? Let's find out!

巧合的是,我的DuckDB 實驗室辦公室里還有這台筆記本電腦,孩子們現在用它來用大字體輸入他們的名字,或者在他們身邊的時候在 YouTube 上看《Bluey》節目。但這台舊電腦還能運行現代的 DuckDB 嗎?它的性能與現代 MacBook 相比如何?我們可能在 2012 年就經歷了如今的數據革命嗎?讓我們來一探究竟!

軟體(Software)

First, what about the operating system? In order to make the comparison fair(er) to the decades, we actually downgraded the operating system on the Retina to OS X 10.8.5 「Mountain Lion」, the operating system version that shipped just a few weeks after the laptop itself in July 2012. Even though the Retina can actually run 10.15 (Catalina), we felt a true 2012 comparison should also use an operating system from the era. Below is a screenshot of the user interface for those of us who sometimes feel a little old.

首先,操作系統怎麼樣?為了讓對比更公平地反映十年前的差異,我們實際上將 Retina 的操作系統降級到了 OS X 10.8.5 「Mountain Lion」,這個版本是在 2012 年 7 月,比這款筆記本電腦發布僅晚了幾周才發布的。雖然 Retina 可以運行 10.15(Catalina),但我們認為,要真正與 2012 年的對比,也應該使用那個時代的操作系統。下面是用戶界面的截圖,方便我們這些有時會覺得有些過時的人查看。

十年數據誤區:DuckDB 揭開「大數據」幻象 - 天天要聞

Moving on to DuckDB itself: here at DuckDB we are more than a little religious about portability and dependencies – or rather the lack thereof. This means that very little had to happen to make DuckDB run on the ancient Mountain Lion: the stock DuckDB binary is built with by default with backwards-compatibility to OS X 11.0 (Big Sur), but simply changing the flag and recompiling turned out to be enough to make DuckDB 1.2.2 run on Mountain Lion. We would have loved to also use a 2012 compiler to build DuckDB, but, alas, C++ 11 was unsurprisingly simply too new in 2012 to be fully supported by compilers. Either way, the binary runs fine and could have been also produced by working around the compiler bugs. Or we could have just hand-coded Assembly like others have done.

繼續討論 DuckDB 本身:在 DuckDB,我們對可移植性和依賴性非常執著 —— 或者更確切地說,是缺乏可移植性和依賴性。這意味著幾乎不需要做任何改動就可以讓 DuckDB 在古老的 Mountain Lion 上運行:默認構建的 DuckDB 二進位文件向後兼容 OS X 11.0 (Big Sur),但只需更改標誌並重新編譯就足以讓 DuckDB 1.2.2 在 Mountain Lion 上運行。我們也希望使用 2012 年的編譯器來構建 DuckDB,但遺憾的是,C++ 11在 2012 年太新了,編譯器無法完全支持它。無論如何,二進位文件運行良好,也可以通過解決編譯器錯誤來生成。或者我們可以像其他人一樣手工編寫彙編代碼。

基準測試(Benchmarks)

But we're not interested in synthetic CPU scores, we're interested in synthetic SQL scores instead! To see how the old machine is holding up when performing serious data crunching, we used the at this point rather tired but well-known TPC-H benchmark at scale factor 1000. This means that the two main tables, lineitem and orders contain 6 and 1.5 Billion rows, respectively. When stored as a DuckDB database, the database has a size of ca. 265 GB.

但我們感興趣的不是綜合 CPU 得分,而是綜合 SQL 得分!為了測試這台老機器在執行大規模數據處理時的表現,我們使用了目前雖然略顯老舊但廣為人知的 TPC-H 基準測試,其規模因子為 1000。這意味著兩個主表lineitem和orders分別包含 60 億行和 15 億行數據。當存儲為 DuckDB 資料庫時,該資料庫的大小約為 265 GB。

From the audited results on the TPC website, we can see that running the benchmark on this scale factor on a single node seems to require hardware costing hundreds of thousands of Dollars.

從TPC網站上的審計結果我們可以看出,在單個節點上運行這個規模的基準測試似乎需要花費數十萬美元的硬體。

We ran each of the 22 benchmark queries five times, and took the median runtime to remove noise. However, because the amount of RAM (16 GB) is very much less than the database size (256 GB), no significant amount of the input data can be cached in the buffer manager, so those are not really what people sometimes call 「hot」 runs.

我們對 22 個基準查詢分別運行了五次,並取中位運行時間以消除干擾。然而,由於 RAM 大小(16 GB)遠小於資料庫大小(256 GB),緩衝區管理器中無法緩存大量的輸入數據,因此這些運行實際上並非人們所說的「熱」運行。

十年數據誤區:DuckDB 揭開「大數據」幻象 - 天天要聞

Below are the per-query results in seconds:

以下是每個查詢的結果(以秒為單位):

TPC-H-SQL

time

1

142.2

2

23.2

3

262.7

4

167.5

5

185.9

6

127.7

7

278.3

8

248.4

9

675.0

10

1266.1

11

33.4

12

161.7

13

384.7

14

215.9

15

197.6

16

100.7

17

243.7

18

2076.1

19

283.9

20

200.1

21

1011.9

22

57.7

But what do those cold numbers actually mean? The hidden sensation is that we actually have numbers, this old computer could actually complete all benchmark queries using DuckDB! If we look at the time a bit closer, we see the queries take anywhere between a minute and half an hour. Those are not unreasonable waiting times for analytical queries on that sort of data in any way. Heck, you would have been waiting way longer back in 2012 for Hadoop YARN to pick up your job in the first place only to spew stack traces at you at some point.

但這些冰冷的數字究竟意味著什麼?我們內心深處的感受是,我們真的掌握了數據,這台老電腦居然能用 DuckDB 完成所有基準測試查詢!如果我們仔細觀察一下時間,就會發現查詢耗時在一分鐘到半小時之間。對於這類數據的分析查詢來說,這樣的等待時間並不算不合理。哎呀,要是在 2012 年,你肯定要等更長時間才能等到 Hadoop YARN 接手你的任務,結果卻在某個時刻向你噴涌而出的堆棧跟蹤信息。

2023 年改進(Improvements)

But how do those results stack up against a modern MacBook? As a comparison point, we used a modern ARM-based M3 Max MacBook Pro, which happened to be sitting on the same desk. But between them, the two MacBooks represent more than a decade of hardware development.

但這些結果與現代 MacBook 相比如何呢?為了進行比較,我們使用了一台現代的 ARM M3 Max MacBook Pro,它恰好放在同一張桌子上。但這兩款 MacBook 加起來代表了十多年的硬體發展歷程。

Looking at GeekBench 5 benchmark scores alone, we see a ca. 7× difference in raw CPU speed when using all cores, and ca. factor 3 difference in single-core speed. Of course there are also big differences in RAM and SSD speeds. Funnily, the display size and resolution are almost unchanged.

僅從GeekBench 5 的基準測試成績來看,我們發現在使用所有核心的情況下,原始 CPU 速度大約相差 7 倍,單核速度大約相差 3 倍。當然,RAM 和 SSD 的速度也有很大差異。有趣的是,顯示屏尺寸和解析度幾乎沒有變化。

Here are the results side-by-side:

以下是並排的結果:

TPC-H-SQL

time_2012

time_2023

加速倍數

1

142.2

19.6

7.26

2

23.2

2.0

11.60

3

262.7

21.8

12.05

4

167.5

11.1

15.09

5

185.9

15.5

11.99

6

127.7

6.6

19.35

7

278.3

14.9

18.68

8

248.4

14.5

17.13

9

675.0

33.3

20.27

10

1266.1

23.6

53.65

11

33.4

2.2

15.18

12

161.7

10.1

16.01

13

384.7

24.4

15.77

14

215.9

9.2

23.47

15

197.6

8.2

24.10

16

100.7

4.1

24.56

17

243.7

15.3

15.93

18

2076.1

47.6

43.62

19

283.9

23.1

12.29

20

200.1

10.9

18.36

21

1011.9

47.8

21.17

22

57.7

4.3

13.42

We do see significant speedups, from 7 up to as much as 53. The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.

我們確實看到了顯著的加速,從 7 上升到 53。時間的幾何平均值從 218 提高到 12,提高了約 20 倍。

復現(Reproducibility)

The binary, scripts, queries, and results are available on GitHub for inspection. We also made the TPC-H SF1000 database file available for download so you don't have to generate it. But be warned, it's a large file.

二進位文件、腳本、查詢和結果均可在 GitHub 上查看。我們還提供了TPC-H SF1000 資料庫文件供下載,您無需自行生成。但請注意,該文件很大。

  • https://github.com/hannes/old-macbook-tpch
  • http://blobs.duckdb.org/data/tpch-sf1000.db

討論(Discussion)

We have seen how the decade-old MacBook Pro Retina has been able to complete a complex analytical benchmark. A newer laptop was able to significantly improve on those times. But absolute speedup numbers are a bit pointless here. The difference is purely quantitative, not qualitative.

我們已經見證了十年前的 MacBook Pro Retina 是如何完成一項複雜的分析基準測試的。新款筆記本電腦能夠顯著提升這些時間。但絕對的加速數字在這裡毫無意義。差異純粹是量變,而非質變。

From a user perspective, it matters much more that those queries complete in somewhat reasonable time, not if it took 10 or 100 seconds to do so. We can tackle almost the same kind of data problems with both laptops, we just have to be willing to wait a little longer. This is especially true given DuckDB's out-of-core capability, which allows it to spill query intermediates to disks if required.

從用戶的角度來看,查詢在合理的時間內完成比耗時 10 秒或 100 秒更重要。我們可以用兩台筆記本電腦處理幾乎相同的數據問題,只是需要願意多等一會兒。考慮到 DuckDB 的核外(out-of-core)功能,這一點尤其重要,該功能允許它在需要時將查詢中間結果溢出到磁碟。

What is perhaps more interesting is that back in 2012, it would have been completely feasible to have a single-node SQL engine like DuckDB that could run complex analytical SQL queries against a database of 6 billion rows in manageable time – and we did not even have to immerse it in dry ice this time.

更有趣的是,早在 2012 年,擁有像 DuckDB 這樣的單節點 SQL 引擎是完全可行的,它可以在可控的時間內對 60 億行的資料庫運行複雜的分析 SQL 查詢——而這次我們甚至不必將其浸入乾冰中。

History is full of 「what if」s, what if something like DuckDB had existed in 2012? The main ingredients were there, vectorized query processing had already been invented in 2005. Would the now somewhat-silly-looking move to distributed systems for data analysis have ever happened? The dataset size of our benchmark database was awfully close to that 99.9% percentile of input data volume for analytical queries in 2024. And while the retina MacBook Pro was a high-end machine in 2012, by 2014 many other vendors shifted to offering laptops with built-in SSD storage and larger amounts of memory became more widespread.

歷史充滿了「如果」,如果像 DuckDB 這樣的東西在 2012 年就存在會怎樣?主要要素都具備,矢量化查詢處理早在 2005 年就已發明。現在看來有些愚蠢的數據分析分散式系統轉型真的會發生嗎?我們基準資料庫的數據集大小非常接近 2024 年分析查詢輸入數據量的 99.9%。雖然視網膜 MacBook Pro 在 2012 年是一款高端電腦,但到了 2014 年,許多其他廠商都轉向提供內置 SSD 存儲的筆記本電腦,大容量內存也變得更加普及。

So, yes, we really did lose a full decade.

所以,是的,我們確實失去了整整十年。

科技分類資訊推薦

卷出天花板的電混旗艦?吉利銀河星耀8試駕後,我有些話想說 - 天天要聞

卷出天花板的電混旗艦?吉利銀河星耀8試駕後,我有些話想說

5月9日,吉利銀河星耀8正式上市,僅幾天時間,話題熱度就迅速攀升。不誇張地說,不少車友群、朋友圈幾乎被這台車「刷了屏」。身邊很多朋友也跑來問我:這車真有說的那麼牛嗎?值得入手嗎?帶著這個疑問,我參加了在成都舉行的區域上市品鑒會暨試駕活動。試完之後,我的第一反應
演繹科技浪漫 「手搓」搓出一個全宇宙 - 天天要聞

演繹科技浪漫 「手搓」搓出一個全宇宙

一個從小愛拆車、沒考上大學的孩子,長大後在幹什麼?答案是在雲南農村造車,把科幻片里的車變成現實。雲南昆明山區的農家院里,抖音創作者@豬堅強 的「床車系列」正在刷新認知,能爬樓的履帶車、水陸兩棲的變形車、帶 AI 語音控制的車。
洞察生意本質,開啟財富密碼 - 天天要聞

洞察生意本質,開啟財富密碼

在當今競爭激烈的商業世界中,人人都渴望抓住生意的本質,實現財富的增長。但究竟什麼才是生意的本質?又該如何憑藉它來賺錢呢?生意的本質,說白了就是滿足需求。就像蘋果公司,他們洞察到人們對於簡潔、美觀且功能強大的電子產品的需求,於是推出了一系列驚
小米高端機大跳水,16GB+512GB+60倍變焦,降幅高達2050元 - 天天要聞

小米高端機大跳水,16GB+512GB+60倍變焦,降幅高達2050元

如果你的預算在3000元左右,你會選擇新款中端旗艦還是會選擇某些大跳水的老款高端手機呢?相信不同的人會有不同的決定。如果你比較側重的是遊戲體驗,對拍照的要求不是很高,相信不少網友都會選擇新款中端旗艦....
一季度華為小米手機銷量大增 蘋果再下滑 - 天天要聞

一季度華為小米手機銷量大增 蘋果再下滑

時間來到了5月末,除了華為的影像旗艦Pura 80系列外,各家廠商都已經完成了這一輪的旗艦機型發布。就在近日,Counterpoint帶來了2025年第一季度的手機銷量統計。各家廠商表現如何?一起來看看。先來看全球市場方面的表現,據報告介紹,2025年Q1全球智能手機市場收入同比增長3%,主要由Apple、vivo及非排名前五的品牌推動。...
蘋果更新過時產品列表,看看有沒有你在用的? - 天天要聞

蘋果更新過時產品列表,看看有沒有你在用的?

據了解,蘋果每隔一段時間就會更新一份「過時產品」名單,包含一些已經停止銷售多年且不再提供官方維修支持的設備。據悉,蘋果會在產品停售5年-7年後標記為過時產品,這意味著這些產品可能無法獲得蘋果官方的維修服務,但如果仍有零件可用,蘋果會提供最多兩年時間的維修服務。而在蘋果最新發布的過時產品列表中,我們發現...
曝低軌衛星通信開始公測:華為Mate X6典藏版首發 - 天天要聞

曝低軌衛星通信開始公測:華為Mate X6典藏版首發

據博主@數碼閑聊站 最新透露,目前低軌衛星通信系統已經進入了公測階段,順利的話,下半年有望實現消費級衛星組網,在無網路覆蓋的區域提供網路連接。參考現有資料,按照衛星軌道高度的不同,通信衛星可以分為低軌通衛星、中軌衛星和高軌衛星。相比於北斗、天通等中高軌道衛星,低軌衛星由於距離地球表面更近,擁有獨特優勢...
兩款真我新機下周發,有一款為限定版機型 - 天天要聞

兩款真我新機下周發,有一款為限定版機型

據悉,真我Neo7 Turbo將於5月29日14:00正式發布,目前新機的兩款外觀已經公布。真我官方表示。真我Neo7 Turbo擁有透明灰和透明黑兩種配色,手機背面配備閃能DART標、NFC靈透線圈、背板晶刻紋理,弧形線圈清晰可見,紋理隱隱若現,科技感拉滿。真我 realme 副總裁、全球營銷總裁、中國區總裁徐起透露,「在透明後蓋之下,是...
推薦Plus智駕版 二代哈弗梟龍MAX購車指南 - 天天要聞

推薦Plus智駕版 二代哈弗梟龍MAX購車指南

作為在國內市場中頗具資歷的自主品牌,哈弗旗下的SUV車型不僅數量龐大,而且定位及價格區間的覆蓋面也相當廣。不過,中型SUV領域一直都是哈弗品牌相對薄弱的環節,在售車型除了第二代哈弗大狗,就是剛剛上市的二代哈弗梟龍MAX。新車全系搭載插電式混合動力系統,配備雙電