首页 > 其他分享> > ClickHouse数据字典

ClickHouse数据字典

2021-01-18 09:29:24 作者：互联网

七、数据字典

数据字典是clickhouse提供的存储媒介，用键值和属性映射的形式定义数据。

数据字典常驻内存，适合保存常量或者经常使用的维度表数据（减少json查询）

数据字典分类：内置字典和扩展字典

7.1 内置字典

内置字典clickhouse默认自带的字典，目前只有一种：Yandex.Metrica字典，快速读取geo地理数据。

默认内置字典禁用状态。

开启方式config.xml
	 <path_to_regions_hierarchy_file>/opt/geo/regions_hierarchy.txt</path_to_regions_hierarchy_file>
     <path_to_regions_names_files>/opt/geo/</path_to_regions_names_files> 

惰性加载，只有当字典首次被查询的时候才会触发加载动作。
填充Yandex.Metrica字典的geo地理数据由以上两种模型组成。

1.path_to_regions_hierarchy_file

path_to_regions_hierarchy_file等同于区域数据的主表，由1个regions_hierarchy.txt和多个regions_hierarchy_[name].txt区域层次的数据文件共同组成。[name]表示区域标识符，与i18n类似。

名称	类型	是否必填	说明
Region ID	UInt32	是	区域ID
Parent Region ID	UInt32	是	上级区域ID
Region Type	UInt8	是	区域类型：1.continect 3.country 4.federal district 5.region 6.city
Population	UInt32	否	入口

2.path_to_regions_names_files

path_to_regions_names_files等同于区域数据的维度，记录了与区域ID对应的区域名称。维度使用6个regions_names_[name].txt文件保存，其中[name]表示区域标识符与regions_hierarchy__[name].txt对应，包括ru,en,ua,by,kz,tr,必须全部定义，首次加载会一次性加载6个区域标识的数据文件。缺一导致内置字典抛出异常初始化失败。

名称	类型	是否必填	说明
Region ID	UInt32	是	区域ID
parent Name	String	是	区域名称

7.2 扩展字典

扩展字典是用户自定义配置以插件形式实现的字典。目前扩展字典支持7种类型的内存布局和4类数据来源。

7.2.1 准备字典数据

获取数据源：

organization.csv

企业组织数据，用于flat,hashed,cache,complex_key_hashed和complex_key_cache

1,2016-01-01,2017-01-10,100
2,2016-05-01,2017-07-01,200
3,2014-03-05,2018-01-20,300
4,2018-08-01,2019-10-01,400
5,2017-03-01,2017-06-01,500
6,2017-04-09,2018-05-30,600
7,2018-06-01,2019-01-25,700
8,2019-08-01,2019-12-12,800

asn.csv

asn数据，用于演示ip_trie字典的场景

"82.118.230.0/24","AS42831","GB"
"148.163.0.0/17","AS53755","US"
"178.93.0.0/18","AS6849","UA"
"200.69.95.0/24","AS262186","CO"
"154.9.160.0/20","AS174","US"

sales.csv

销售数据，用于演示range_hashed字典的演示

1,"a0001","研发部"
2,"a0002","产品部"
3,"a0003","数据部"
4,"a0004","测试部"
5,"a0005","运维部"
6,"a0006","规划部"

7.2.2 扩展字典配置文件的元素组成

扩展字典有配置文件config.xml中dictionaries_config配置向指定：

	<!-- Configuration of external dictionaries. See:
   		  https://clickhouse.yandex/docs/en/dicts/external_dicts/
	-->
	<dictionaries_config>*_dictionary.xml</dictionaries_config>

vim /etc/clickhouse-server/test_dictionary.xml
<dictionaries>
    <dictionary>
        <name>test</name>
        <source>
        <odbc>
            <connection_string>DSN=pg;UID=;PWD=;HOST=;PORT=5432;DATABASE=</connection_string>
            <table>product</table>
        </odbc>
        </source>
        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
        <layout>
            <hashed/>
        </layout>
        <structure>
            <id>
                <name>id</name>
            </id>
            <attribute>
                <name>del_flag</name>
                <type>UInt64</type>
                <null_value>0</null_value>
            </attribute>
        </structure>
    </dictionary>
</dictionaries>

dictionary元素定义分为5个子元素：

1.name：字典的名称，用于确定字典的唯一标识，必须全局唯一。多个字点之间不允许重复。

2.structure:字典的数据结构

3.layout:字典的类型。

4.source:字典的数据源。

5.lifetime:字典的更新时间。

7.2.3 扩展字典的数据结构

扩展字典的数据结构，由structure元素定义，由键值Key（描述字典的数据标识）和属性attribute（字段属性）两部分组成。

<structure>
            <id>								
                <name>id</name>											id或者key相关的属性
            </id>
            <attribute>													字段属性
                <name>del_flag</name>
                <type>UInt64</type>
                <null_value>0</null_value>
            </attribute>
        </structure>

1.key

用于定义字典的键值，每个字典必须包含一个键值key字段，用于定义数据，类似数据库表的主键。

键值key分为数值型和复合型。

数值型：数值型key有UInt64整型定义，支持flat,hashed,range_hashed和cache类型的字典。

复合型：复合型key使用Tuple元组定义，可以由1到多个字段组成，类似数据库的符合主键，仅支持complex_key_hashed,complex_key_cache和ip_trie类型的字典。

2.attribute

用于定义字典的属性字段，字典可以拥有1到多个属性字段。

配置名称	是否必填	默认值	说明
name	是	-	字段名称
type	是	-	字段类型
null_value	是	-	在查询时，条件key没有对应元素时默认值
expression	否	无表达式	表达式，可以调用函数或者使用运算符
hierarchical	否	false	是否支持层次结构
injective	否	false	是否支持集合单射优化
is_object_id	否	false	是否开启MongoDB优化

7.2.4 扩展字典的类型

扩展字典的类型使用layout元素定义，目前支持7种类型。一个字段类型，决定了其数据在内存中的存储结构和字典支持的key键类型。

根据key键类型划分：

一类是单数值的key类型，包括flat、hashed、range_hashed和cache;

另一类是复合key类型，包括complex_key_hashed、complex_key_cache和ip_trie。

7.2.4.1 flat

flat字典是性能最高的字典类型，只能以UInt64数值型key，使用数组结构存储,初始大小为1024，上限为500000（存储500000行数据）。在创建字典时数据量超出其上限，那么字典会创建失败。

cat /etc/clickhouse-server/test_dictionary.xml 

<?xml version="1.0"?>
<dictionaries>
    <dictionary>
        <name>test_flat_dict</name>
        <!--数据源-->
        <source>
            <file>
                <path>/home/clickhouse/organization.csv</path>
                <format>CSV</format>
            </file>
        </source>

        <!--字典类型-->
        <layout>
            <flat/>
        </layout>

        <!--与数据结构对应-->
        <structure>
            <id>
                <name>id</name>
            </id>
            
            <attribute>
                <name>code</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>

            <attribute>
                <name>name</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>

        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
    </dictionary>
</dictionaries>

select * from system.dictionaries\G

Row 1:
──────
database:                    
name:                        test_flat_dict
uuid:                        00000000-0000-0000-0000-000000000000
status:                      NOT_LOADED
origin:                      /etc/clickhouse-server/test_dictionary.xml
type:                        
key:                         
attribute.names:             []
attribute.types:             []
bytes_allocated:             0
query_count:                 0
hit_rate:                    0
element_count:               0
load_factor:                 0
source:                      
lifetime_min:                0
lifetime_max:                0
loading_start_time:          1970-01-01 08:00:00
last_successful_update_time: 1970-01-01 08:00:00
loading_duration:            0
last_exception:              

1 rows in set. Elapsed: 0.005 sec. 



SELECT dictGet('test_flat_dict', 'name', toUInt64(1));

┌─dictGet('test_flat_dict', 'name', toUInt64(1))─┐
│ 研发部                                         │
└────────────────────────────────────────────────┘


select * from system.dictionaries\G

Row 1:
──────
database:                    
name:                        test_flat_dict
uuid:                        00000000-0000-0000-0000-000000000000
status:                      LOADED
origin:                      /etc/clickhouse-server/test_dictionary.xml
type:                        Flat
key:                         UInt64
attribute.names:             ['code','name']
attribute.types:             ['String','String']
bytes_allocated:             41328
query_count:                 2
hit_rate:                    1
element_count:               6
load_factor:                 0.005859375
source:                      File: /home/clickhouse/organization.csv CSV
lifetime_min:                300
lifetime_max:                360
loading_start_time:          2020-12-24 13:58:29
last_successful_update_time: 2020-12-24 13:58:29
loading_duration:            0.001
last_exception:              

1 rows in set. Elapsed: 0.007 sec.

7.2.4.2 hashed

hashed字典与flat不同的是，flat是以数组的方式存储，hashed则是散列结构，没有上限制约。

以下是hashed字典的配置示例：

<?xml version="1.0"?>
<dictionaries>
    <dictionary>
        <name>test_hashed_dict</name>
        <!--数据源-->
        <source>
            <file>
                <path>/home/clickhouse/organization.csv</path>
                <format>CSV</format>
            </file>
        </source>

        <!--字典类型  只有这个地方不一样-->
        <layout>
            <hashed/>
        </layout>

        <!--与数据结构对应-->
        <structure>
            <id>
                <name>id</name>
            </id>

            <attribute>
                <name>code</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>

            <attribute>
                <name>name</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>

        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
    </dictionary>
</dictionaries>

SELECT dictGet('test_flat_dict', 'name', toUInt64(1));

┌─dictGet('test_flat_dict', 'name', toUInt64(1))─┐
│ 研发部                                         │
└────────────────────────────────────────────────┘



select * from system.dictionaries\G

Row 1:
──────
database:                    
name:                        test_flat_dict
uuid:                        00000000-0000-0000-0000-000000000000
status:                      LOADED
origin:                      /etc/clickhouse-server/test_dictionary.xml
type:                        Hashed
key:                         UInt64
attribute.names:             ['code','name']
attribute.types:             ['String','String']
bytes_allocated:             20880
query_count:                 1
hit_rate:                    1
element_count:               12
load_factor:                 0.046875
source:                      File: /home/clickhouse/organization.csv CSV
lifetime_min:                300
lifetime_max:                360
loading_start_time:          2020-12-24 14:05:29
last_successful_update_time: 2020-12-24 14:05:29
loading_duration:            0
last_exception:              

1 rows in set. Elapsed: 0.005 sec.

7.2.4.3 range_hashed

在hashed字典的基础上增加了指定时间区间的特性，数据会以散列结构存储并按照时间排序。时间的区间通过range_min和range_max元素指定，所指定的字段必须是Date或者DateTime类型。

<?xml version="1.0"?>
<dictionaries>
    <dictionary>
        <name>test_range_hashed_dict</name>
        <!--数据源-->
        <source>
            <file>
                <path>/home/clickhouse/sales.csv</path>
                <format>CSV</format>
            </file>
        </source>

        <!--字典类型  只有这个地方不一样-->
        <layout>
            <range_hashed/>
        </layout>

        <!--与数据结构对应-->
        <structure>
            <id>
                <name>id</name>
            </id>

            <range_min>
                <name>start</name>
                <!--如果 type 如果没有指定，则默认类型将使用-Date-->
            </range_min>

            <range_max>
                <name>end</name>
            </range_max>

            <attribute>
                <name>price</name>
                <type>Float32</type>
                <null_value></null_value>
            </attribute>
        </structure>

        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
    </dictionary>
</dictionaries>


select name,type,key,attribute.names,attribute.types from system.dictionaries;

┌─name───────────────────┬─type────────┬─key────┬─attribute.names─┬─attribute.types─┐
│ test_range_hashed_dict │ RangeHashed │ UInt64 │ ['price']       │ ['Float32']     │
└────────────────────────┴─────────────┴────────┴─────────────────┴─────────────────┘

SELECT *
FROM system.dictionaries

Row 1:
──────
database:                    
name:                        test_range_hashed_dict
uuid:                        00000000-0000-0000-0000-000000000000
status:                      LOADED
origin:                      /etc/clickhouse-server/test_dictionary.xml
type:                        RangeHashed
key:                         UInt64
attribute.names:             ['price']
attribute.types:             ['Float32']
bytes_allocated:             8352
query_count:                 0
hit_rate:                    1
element_count:               8
load_factor:                 0.03125
source:                      File: /home/clickhouse/sales.csv CSV
lifetime_min:                300
lifetime_max:                360
loading_start_time:          2020-12-24 15:12:42
last_successful_update_time: 2020-12-24 15:12:42
loading_duration:            0
last_exception:              

1 rows in set. Elapsed: 0.006 sec.

7.2.4.4 cache

在内存中会通过固定长度的向量数组保存，长度为2 的整数倍并会自动向上取整，并不会像其他字典查询一次后一次性全部直接加载到内存，而是命中一次加载一次，所以性能最不稳定，完全取决于命中率（缓存命中率=命中次数/查询次数）

7.2.4.5 complex_key_hashed

该类型的字典在功能上与hashed字典完全相同，只是将单个的数值的key替换成了复合型

<?xml version="1.0"?>
<dictionaries>
    <dictionary>
        <name>test_complex_hashed_dict</name>
        <!--数据源-->
        <source>
            <file>
                <path>/home/clickhouse/organization.csv</path>
                <format>CSV</format>
            </file>
        </source>

        <!--字典类型  只有这个地方不一样-->
        <layout>
            <complex_key_hashed/>
        </layout>

        <!--与数据结构对应-->
        <structure>
            <key>
                <attribute>
                    <name>id</name>
                    <type>UInt64</type>
                </attribute>
                <attribute>
                    <name>code</name>
                    <type>String</type>
                </attribute>
            </key>

            <attribute>
                <name>name</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>

        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
    </dictionary>
</dictionaries>


SELECT dictGet('test_complex_hashed_dict', 'name', (toUInt64(1), 'a0001'))

┌─dictGet('test_complex_hashed_dict', 'name', tuple(toUInt64(1), 'a0001'))─┐
│ 研发部                                                                   │
└──────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.006 sec. 



SELECT *
FROM system.dictionaries

Row 1:
──────
database:                    
name:                        test_complex_hashed_dict
uuid:                        00000000-0000-0000-0000-000000000000
status:                      LOADED
origin:                      /etc/clickhouse-server/test_dictionary.xml
type:                        ComplexKeyHashed
key:                         (UInt64, String)
attribute.names:             ['name']
attribute.types:             ['String']
bytes_allocated:             18728
query_count:                 1
hit_rate:                    1
element_count:               6
load_factor:                 0.0234375
source:                      File: /home/clickhouse/organization.csv CSV
lifetime_min:                300
lifetime_max:                360
loading_start_time:          2020-12-24 14:37:35
last_successful_update_time: 2020-12-24 14:37:35
loading_duration:            0.002
last_exception:              

1 rows in set. Elapsed: 0.007 sec.

7.2.4.6 complex_key_cache

在cache字典的基础上，将单数值的key替换为复合型。

7.2.4.7 ip_trie

专门用于IP前缀查询的场景。

<?xml version="1.0"?>
<dictionaries>
    <dictionary>
        <name>test_ip_trie_dict</name>
        <!--数据源-->
        <source>
            <file>
                <path>/home/clickhouse/asn.csv</path>
                <format>CSV</format>
            </file>
        </source>

        <!--字典类型-->
        <layout>
            <ip_trie/>
        </layout>

        <!--与数据结构对应-->
        <structure>
            <key>
                <attribute>
                    <name>prefix</name>
                    <type>String</type>
                </attribute>
            </key>

            <attribute>
                <name>asn</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>

            <attribute>
                <name>country</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>

        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
    </dictionary>
</dictionaries>


SELECT dictGet('test_ip_trie_dict', 'country', tuple(IPv4StringToNum('148.163.0.0')))

┌─dictGet('test_ip_trie_dict', 'country', tuple(IPv4StringToNum('148.163.0.0')))─┐
│ US                                                                             │
└────────────────────────────────────────────────────────────────────────────────┘

总结

名称	存储结构	字典键类型	支持的数据源
flat	数组	UInt64	Local file、Executable file、HTTP、DBMS
hashed	散列	UInt64	Local file、Executable file、HTTP、DBMS
range_hashed	散列按时间排序	UInt64和时间	Local file、Executable file、HTTP、DBMS
complex_key_hashed	散列	复合型key	Local file、Executable file、HTTP、DBMS
ip_trie	层次结构	复合型key(单个String)	Local file、Executable file、HTTP、DBMS
cache	固定大小数组	UInt64	Executable file、HTTP、ClickHouse、MySQL
complex_key_cache	固定大小数组	复合型key	Executable file、HTTP、ClickHouse、MySQL

7.2.5 扩展字典的数据源

文件类型

本地文件：采用file元素定义。path定义绝对路径，format定义数据格式（CSV或者TabSeparated）

可执行文件：采用executable元素定义。command定义绝对路径，format定义数据格式

远程文件：采用http元素定义。url定义数据访问路径，format定义数据格式
数据库类型

MySQL：数据源支持指定的数据库中提取数据，作为字典的数据来源。

ClickHouse：准备好数据源的测试数据编写配置文件即可。

MongoDB：执行语句后，会自动创建相应的schema并写入数据。

7.2.6 扩展字典的数据更新策略

扩展字典支持数据的在线更新，更新无须重启服务。字典数据的更新频率由配置文件中lifetime元素定义，单位秒。同时也代表缓存失效时间。

    <lifetime>
        <min>300</min>
        <max>360</max>
    </lifetime>

min和max分别指定了更新间隔的上下限。ClickHouse会在这个时间区间内触发更新操作。

文件数据源

文件类型的数据源的previous值来自系统文件的修改时间。

stat test_dictionary.xml 
  File: ‘test_dictionary.xml’
  Size: 1077      	Blocks: 8          IO Block: 4096   regular file
Device: 803h/2051d	Inode: 202620427   Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-12-24 15:26:21.089166098 +0800
Modify: 2020-12-24 15:26:16.121165975 +0800
Change: 2020-12-24 15:26:16.121165975 +0800

当前后的previous的值不相同时，才会触发数据更新。

MySQL（InnoDB）,clickhouse和ODBC

它的值previous值来源invalidate_query中定义的SQL语句。
MySQL（MyISAM）

通过SHOW TABLE STATUS命令查看修改时间，前后两次Update_time的值不同，则会判定源数据发生了变化。

7.2.7 扩展字典的基本操作

元数据查询

select name,type,key,attribute.names,attribute.types,source from system.dictionaries;

Row 1:

name:            test_ip_trie_dict
type:            Trie
key:             (String)
attribute.names: ['asn','country']
attribute.types: ['String','String']
source:          File: /home/clickhouse/asn.csv CSV



name:                        字典名称
status:                      字典状态
origin:                      字典加载的配置文件
type:                        字典所属类型
key:                         字典的Key值，数据通过Key值定位
attribute.names:             属性名称
attribute.types:             属性类型
bytes_allocated:             已载入数据在内存中占用的字节数
query_count:                 字典被查询的次数
hit_rate:                    字典数据查询的命中率
element_count:               已载入数据的行数
load_factor:                 数据的加载率
source:                      数据源信息
last_exception:              异常信息，重点关注对象

数据查询

可以使用字典函数(dictGet)或者system.dictionaries查询。

标签：name,attribute,hashed,key,test,数据,ClickHouse,字典
来源： https://blog.csdn.net/weixin_45320660/article/details/112761808