redis6.0.5之HyperLogLog阅读笔记1-基数估算前言翻译
作者:互联网
/* The Redis HyperLogLog implementation is based on the following ideas: Redis的超对数实现是基于以下的想法: * * The use of a 64 bit hash function as proposed in [1], in order to don't * limited to cardinalities up to 10^9, at the cost of just 1 additional * bit per register. 使用文1推荐的64位的哈希函数,仅用对每个寄存器增加一个比特的代价,就可以突破基数个数10的9次方的限制。 * * The use of 16384 6-bit registers for a great level of accuracy, using * a total of 12k per key. 为了获得一个比较高的精度,使用16384个6比特的寄存器,每个键值使用了12k(2的14次方 * 6 = 12k) * * The use of the Redis string data type. No new type is introduced. * * No attempt is made to compress the data structure as in [1]. Also the * algorithm used is the original HyperLogLog Algorithm as in [2], with * the only difference that a 64 bit hash function is used, so no correction * is performed for values near 2^32 as in [1]. 使用的是redis的字符串类型,没有引入新的数据类型。 没有采用文1中压缩数据结构的算法。同时使用了文2中原始的超对数算法,唯一不同的是使用了64位的哈希函数, 所以不用对类似文1中接近2^32的值做修正。 * [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic * Engineering of a State of The Art Cardinality Estimation Algorithm. * * [2] P. Flajolet, Eric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The * analysis of a near-optimal cardinality estimation algorithm. (这两篇文章留到后面再慢慢翻译) * Redis uses two representations: redis使用两种表示方式: * 1) A "dense" representation where every entry is represented by * a 6-bit integer. 密集表示方式 每个实体用6比特整数来表示 * 2) A "sparse" representation using run length compression suitable * for representing HyperLogLogs with many registers set to 0 in * a memory efficient way. 稀疏表示方式 使用行程长的压缩方式表示具有很多个0值寄存器的超级日志,这是一中非常有效的内存表示方法 * * HLL header HLL的头部 * === * * Both the dense and sparse representation have a 16 byte header as follows: 密集和稀疏的表示都拥有一个16个字节的如下头部: * +------+---+-----+----------+ * | HYLL | E | N/U | Cardin. | * +------+---+-----+----------+ * * The first 4 bytes are a magic string set to the bytes "HYLL". * "E" is one byte encoding, currently set to HLL_DENSE or * HLL_SPARSE. N/U are three not used bytes. 开始的4个字节是一个魔法数,被设置为HYLL. E表示一个字节的编码,当前用来设置HLL_DENSE或者HLL_SPARSE。N/U是3个还么有被使用的字节。 * The "Cardin." field is a 64 bit integer stored in little endian format * with the latest cardinality computed that can be reused if the data * structure was not modified since the last computation (this is useful * because there are high probabilities that HLLADD operations don't * modify the actual data structure and hence the approximated cardinality). Cardin.字段 是一个小端格式保存的最近计算得到64位的集合中元素个数。 如果自从上次计算之后数据结构没有发生变化,那么可以这个数值可以重复使用。 (这个是很有用的,因为HLLADD操作大概率不会修改实际的数据结构,从而不需要修改近似的集合中元素个数) * When the most significant bit in the most significant byte of the cached * cardinality is set, it means that the data structure was modified and * we can't reuse the cached value that must be recomputed. 一旦缓存基数的最重要字节的最重要的比特位被设置, 那就意味着数据结构被修改并且我们不能使用缓存的值,必须要重新计算 * Dense representation 密集表示 * === * * The dense representation used by Redis is the following: redis使用的密集表示如下 * +--------+--------+--------+------// //--+ * |11000000|22221111|33333322|55444444 .... | * +--------+--------+--------+------// //--+ * * The 6 bits counters are encoded one after the other starting from the * LSB to the MSB, and using the next bytes as needed. 6个比特位从lsb到msb依次编码,根据情况使用下一个字节的比特位。 * Sparse representation 稀疏表示 * === * * The sparse representation encodes registers using a run length * encoding composed of three opcodes, two using one byte, and one using * of two bytes. The opcodes are called ZERO, XZERO and VAL. 稀疏表示编码寄存器使用了包含3种操作符的行程长编码,其中两个操作符使用1个字节,一个操作符使用两个字节。 这三个操作符分别为 ZERO, XZERO and VAL. * ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented * by the six bits 'xxxxxx', plus 1, means that there are N registers set * to 0. This opcode can represent from 1 to 64 contiguous registers set * to the value of 0. ZERO操作符表示为00xxxxxx。6比特的整数用六个比特的'xxxxxx'表示,还要加上1,即意味着有N个寄存器被设置为0. 这个操作符可以表示从1到64个连续为0的寄存器。 * XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit * integer represented by the bits 'xxxxxx' as most significant bits and * 'yyyyyyyy' as least significant bits, plus 1, means that there are N * registers set to 0. This opcode can represent from 0 to 16384 contiguous * registers set to the value of 0. XZERO操作符用两个字节01xxxxxx yyyyyyyy来表示。14个比特的整数用高6位的比特'xxxxxx'和低8位的比特'yyyyyyyy'表示。 同时还需要加1,即意味着N个寄存器设置为0.这个操作符能够代表从0(加1的话,这里应该为1)到16384个连续为0的寄存器。 * VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer * representing the value of a register, and a 2-bit integer representing * the number of contiguous registers set to that value 'vvvvv'. * To obtain the value and run length, the integers vvvvv and xx must be * incremented by one. This opcode can represent values from 1 to 32, * repeated from 1 to 4 times. VAL操作符由1vvvvvxx表示,其中'vvvvv'5比特的整数表示寄存器的值,'xx'2比特的整数表示连续的值为'vvvvv'的寄存器个数。 为了获取值和行程长,整数 vvvvv 和 xx 必须加1.这个操作符表示从1到32的值,和1到4的重复次数。 * The sparse representation can't represent registers with a value greater * than 32, however it is very unlikely that we find such a register in an * HLL with a cardinality where the sparse representation is still more * memory efficient than the dense representation. When this happens the * HLL is converted to the dense representation. 稀疏表示不能表示值超过值为32的寄存器,然而,不可能在HLL算法中找到这样一个寄存器,稀疏表示任然比密集表示更加内存有效。 如果这种情况发生,那么HLL算法就改用密集表示。 * The sparse representation is purely positional. For example a sparse * representation of an empty HLL is just: XZERO:16384. 稀疏表示纯粹是位置表示。举一个稀疏表示的例子,一个空的HLL表示,就是 XZERO:16384 * An HLL having only 3 non-zero registers at position 1000, 1020, 1021 * respectively set to 2, 3, 3, is represented by the following three * opcodes: 一个HLL只有3个非零的寄存器,分别在位置1000, 1020, 1021,对应的值为2,3,3, 可以用如下的操作符来表示 * XZERO:1000 (Registers 0-999 are set to 0) 从0到999的寄存器都是0 * VAL:2,1 (1 register set to value 2, that is register 1000) 寄存器1000的值是2 * ZERO:19 (Registers 1001-1019 set to 0) 寄存器1001到1019的寄存器都是0 * VAL:3,2 (2 registers set to value 3, that is registers 1020,1021) 寄存器1020,1021的值都是3 * XZERO:15362 (Registers 1022-16383 set to 0) 寄存器1022到16383的值都是0 * * In the example the sparse representation used just 7 bytes instead * of 12k in order to represent the HLL registers. In general for low * cardinality there is a big win in terms of space efficiency, traded * with CPU time since the sparse representation is slower to access: 在上面的例子中,稀疏表示值使用了7个字节而不是12K表示了HLL寄存器。一般来说,对于低基数,在空间效率方面有很大优势, 因为稀疏表示访问速度比较慢,可以用CPU的时间来交换。 * The following table shows average cardinality vs bytes used, 100 * samples per cardinality (when the set was not representable because * of registers with too big value, the dense representation size was used * as a sample). 下面的表格展示了平均基数 和 使用字节的对比, 每基数100个样本(当集合由于值太大而无法表示时,使用密集表示大小作为样本) * 100 267 * 200 485 * 300 678 * 400 859 * 500 1033 * 600 1205 * 700 1375 * 800 1544 * 900 1713 * 1000 1882 * 2000 3480 * 3000 4879 * 4000 6089 * 5000 7138 * 6000 8042 * 7000 8823 * 8000 9500 * 9000 10088 * 10000 10591 * * The dense representation uses 12288 bytes, so there is a big win up to * a cardinality of ~2000-3000. For bigger cardinalities the constant times * involved in updating the sparse representation is not justified by the * memory savings. The exact maximum length of the sparse representation * when this implementation switches to the dense representation is * configured via the define server.hll_sparse_max_bytes. */ 密集表示法使用12288字节,因此在基数为~2000-3000的情况下有很大的优势。 对更大的基数,更新稀疏表示的常量时间不是简单的内存节省。 当实现切换到密集表示的时,准确的稀疏表示的最大长度被配置在变量 server.hll_sparse_max_bytes中 config.c中 默认的初始化配置为3000 createSizeTConfig("hll-sparse-max-bytes", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.hll_sparse_max_bytes, 3000, MEMORY_CONFIG, NULL, NULL)
标签:registers,redis6.0,HyperLogLog,HLL,set,sparse,寄存器,representation,前言 来源: https://www.cnblogs.com/cquccy/p/14519280.html