博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Columnar Storage
阅读量:5925 次
发布时间:2019-06-19

本文共 5356 字,大约阅读时间需要 17 分钟。

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage – which is a simple idea – can drastically improve performance for certain workloads.

 

磁盘仍然是bigdata查询的最大的瓶颈. 

所以disk-based query engines需要解决的最大问题, 仍然是如何提高从磁盘读取数据的效率. 
思路, 可以从优化磁盘寻道时间和效率入手, 这个比较困难,进展缓慢 
更多的, 应该考虑怎样尽量从磁盘中读取有效的数据, 即减少查询所需要的磁盘读取量 

Disks are still the major bottleneck in query execution over large datasets. 

Therefore, the best thing an engineer can do to improve the performance of disk-based query engines (like RDBMs and Impala) usually is to improve the performance of reading bytes from disk.

1. The traditional way to improve disk bandwidth has been to wait, and allow disks to get faster

However, disks are not getting faster very quickly (having settled at roughly 100 MB/s, with ~12 disks per server), and SSDs can’t yet achieve the storage density to be directly competitive with HDDs on a per-server basis.

 

2. The other way to improve disk performance is to maximise the ratio of ‘useful’ bytes read to total bytes read. The idea is not to read more data than is absolutely necessary to serve a query, so the useful bandwidth realised is increased without actually improving the performance of the IO subsystem.

下面就'怎样最大化在磁盘读出数据中的有用数据的比例?'思路来阐述

Columns VS Rows

传统基于row的存储方式

Traditional database file format store data in rows, where each row is comprised of a contiguous collection of column values.

On disk, that looks roughly like the following:

This row-major layout usually has a header for each row that describes, for example, which columns in the row are NULL. Each column value is then stored contiguously after the header, followed by another row with its own header, and so on.

 

SELECT * FROM table

requires returning every single column of every single row in the table

SELECT <col_set> FROM table WHERE <predicate_set>

the set of rows returned by the evaluation of the predicates over the table is a large proportion of the total set of rows (i.e. the selectivity is high)

the predicate_set requires reading a large subset of the set of columns or b) col_set is a large subset of the set of columns (i.e. the projectivity is high)

 

More simply, a query is going to be efficient if it requires reading most of the columns of most of the rows. 

In these cases, row-major formats allow the query execution engine to achieve good IO efficiency. 
总结的很简单, 对于row-based, 无论实际需要多少column, 总是需要读出整行 
所以如果确实需要读出大部分column (无论在select部分或where部分), 那么row-major是高效的方式

 

基于column的存储方式

However, there is a general consensus that these SELECT * kinds of queries are not representative of typical analytical workloads; instead either a large number of columns are not projected, or they are projected only for a small subset of rows where only a few columns are required to decide which rows to return.

Coupled with a general trend towards very wide tables with high column counts, the total number of bytes that are required to satisfy a query are often a relatively small fraction of the size on disk of the target table. In these cases, row-major formats often are quite wasteful in the amount of IO they require to execute a query.

但是在现实中, 很多场景往往只需要很少一部分的column, 那么row-major就显得非常低效 

自然的思路是, 既然我只需要某一个column, 那么把每个column分开存放并读取就显得更合理

Instead of a format that makes it efficient to read entire rows, it’s advantageous for analytical workloads to make it efficient to read entire columns at once. 

Based on our understanding of what makes disks efficient, we can see that the obvious approach is to store columns values densely and contiguously on disk. 
This is the basic idea behind columnar file formats. The following diagram shows what this looks like on disk:

 

QUERY EXECUTION

The diagram below shows what a simple query plan for SELECT col_b FROM table WHERE col_a > 5

Reading from a traditional row-major file format

对于row-major, 需要读出所有的row来完成这个查询, 明显是比较低效的, 如果row中包含大量的columns

Reading from columnar storage

 

This, then, is the basic idea of columnar storage: we recognise that analytical workloads rarely require full scans of all table data, but do often require full scans of a small subset of the columns, and so we arrange to make column scans cheap at the expense of extra cost reading individual rows.

 

THE COST OF COLUMNAR

Is this a free lunch? Should every analytical database go out and change every file format to be column-major? 

Obviously the story is more complicated than that. There are some query archetypes that suffer when data is stored in a columnar format.

天下没有免费的午餐, 在bigdata领域, 没有one thing fits all 

所以使用columnar方案的问题如下

1. It is expensive to reassemble a row, since the separate values that comprise it are spread far across the disk. Every column included in a projection implies an extra disk seek, and this can add up when the projectivity of a query is high. 所以columnar只适用于读取少量column的case, 否则reassemble的耗费会很高

2. Row-major formats have good in-memory spatial locality, and there are common operations that benefit enormously from this. 比如, 做两个column的相加, 对row-major就很简单, 而column的逻辑就复杂许多

所以对这样的情况, 比如HBase, 就采取column family的策略, 把经常一起使用的column放在一起, 达到很好的优化效果

本文章摘自博客园,原文发布日期:2013-03-26

转载地址:http://mlovx.baihongyu.com/

你可能感兴趣的文章
2018年尾总结——稳中成长
查看>>
$resource in AngularJS
查看>>
java虚拟机学习笔记 【1】
查看>>
DUBBO笔记
查看>>
nginx php上传大文件的设置(php-fpm)
查看>>
MySQL 运行状态监控方法
查看>>
vs2008中在解决方案资源管理器查看当前打开文件
查看>>
ubuntu14.04 鼠标闪烁问题
查看>>
jQuery Lightbox(balupton版)图片展示插件demo
查看>>
Elasticsearch集群的简单搭建
查看>>
SCRT-SSH传输文件
查看>>
Python非常cool的svg格式chart生成库pygal
查看>>
Telnet部署与启动 windows&&linux
查看>>
行列式的乘法定理
查看>>
有1000瓶水,3个瓶子可以再换1瓶,一共可以喝多少瓶?
查看>>
Search in Rotated Sorted Array ||
查看>>
NUC_HomeWork1 -- POJ2067(最短路)
查看>>
卸载mysql
查看>>
二叉树的遍历
查看>>
The Distinguish of the share or static lib in MFC
查看>>