首页 > 编程语言> > python – h5py不坚持分块规范？

python – h5py不坚持分块规范？

2019-09-26 11:55:40 作者：互联网

问题：
我有现有的netCDF4文件(大约5000个),(通常形状为96x3712x3712)datapoints(float32).这些文件的第一个维度是时间(每天1个文件),第二个和第三个空间维度.
目前,由于以下原因,在第一维(即使是部分切片)上制作切片会花费大量时间：

> netCDF文件以1x3712x3712的块大小进行分块.切片时间维度基本上会读取整个文件.
>在所有较小的文件上循环(即使在多个进程中)也会花费大量时间.

我的目标：

>创建月度文件(约2900x3712x3712)数据点
>优化它们以便在时间维度上切片(2900x1x1的块大小或空间维度稍大)

其他需求：

>文件应该可以通过单个时间戳(1x3712x3712)附加,并且此更新过程应该少于15分钟
>查询应该足够快：在不到一秒的时间内完整切片(即2900x1x1)==>实际上没有那么多数据……
>优选地,文件应该是可访问的,以便在被更新时由多个进程读取
>处理历史数据(其他5000个日常文件)最好只需不到几周.

我尝试过多种方法：

>连接netcdf文件并重新组合它们==>需要太多记忆和太多时间……
>将它们从熊猫写入hdf文件(使用pytables)==>创建一个包含巨大索引的宽表.这最终将花费太多时间来阅读并且由于元数据约束而要求数据集在空间维度上平铺.
>我的最后一种方法是使用h5py将它们写入hdf5文件：

这是创建单个月度文件的代码：

import h5py
import pandas as pd
import numpy as np

def create_h5(fps):
    timestamps=pd.date_range("20050101",periods=31*96,freq='15T') #Reference time period
    output_fp = r'/data/test.h5'
    try:
        f = h5py.File(output_fp, 'a',libver='latest')
        shape = 96*nodays, 3712, 3712
        d = f.create_dataset('variable', shape=(1,3712,3712), maxshape=(None,3712,3712),dtype='f', compression='gzip', compression_opts=9,chunks=(1,29,29))
        f.swmr_mode = True
        for fp in fps:
            try:
                nc=Dataset(fp)
                times = num2date(nc.variables['time'][:], nc.variables['time'].units)
                indices=np.searchsorted(timestamps, times)
                for j,time in enumerate(times):
                    logger.debug("File: {}, timestamp: {:%Y%m%d %H:%M}, pos: {}, new_pos: {}".format(os.path.basename(fp),time,j,indices[j]))
                    d.resize((indices[j]+1,shape[1],shape[2]))
                    d[indices[j]]=nc.variables['variable'][j:j+1]
                    f.flush()
            finally:
                nc.close()
    finally:
        f.close()
    return output_fp

我正在使用最新版本的HDF5来获得SWMR选项. fps参数是每日netCDF4文件的文件路径列表.它在大约2小时内创建文件(在ssd上,但我看到创建文件主要是CPU绑定的),这是可以接受的.

我有压缩设置,以保持文件大小在限制范围内.我没有做过早期的测试,并且发现没有更快的创建但是切片在压缩时需要的时间不会太长. H5py会自动整理1x116x116块中的数据集.

现在问题是：在具有RAID 6设置的NAS上切片,需要大约20秒来切片时间维度,即使它在一个块中…

我认为,即使它在文件中的一个块中,因为我在循环中编写了所有值,所以必须对它进行分段(不知道这个过程是如何工作的).这就是为什么我尝试使用HDF5的CML工具将h5repack放入一个新文件中,使用相同的块,但希望重新排序这些值,以便查询能够以更顺序的顺序读取值,但没有运气.即使这个过程需要6小时才能运行,但它对查询速度没有任何影响.

如果我的计算正确,读取一个块(2976x32x32)只有几MB大(11MB未压缩,我认为只有1MB以上的压缩).这怎么可能这么久？我究竟做错了什么？如果有人能够了解幕后的实际情况,我会很高兴…

解决方法:

块大小的影响

在最坏的情况下,读取和写入一个块可以被认为是随机读/写操作. SSD的主要优点是读取或写入小块数据的速度.硬盘在此任务上要慢得多(可以观察到100倍),NAS甚至可以比硬盘慢得多.

因此问题的解决方案将是更大的块大小.我系统上的一些基准测试(Core i5-4690).

Exampe_1(块大小(1,29,29)= 3,4 kB)：

import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c

def original_chunk_size():
    File_Name_HDF5='some_Path'
    #Array=np.zeros((1,3712,3712),dtype=np.float32)
    Array=np.random.rand(96,3712,3712)

    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    nodays=1

    shape = 96*nodays, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(1,29,29),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays):
        d[i:i+1,:,:]=Array

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    d=f['variable']

    for i in xrange(0,3712,29):
        for j in xrange(0,3712,29):
            A=np.copy(d[:,i:i+29,j:j+29])

    print(time.time()-t1)

结果(写/读)：

SSD：38s / 54s

硬盘：40s / 57s

NAS：252s / 823s

在第二个例子中,我将使用h5py_chache,因为我不想继续提供(1,3712,3712)的块.标准的chunk-chache-size只有一个MB,因此必须进行更改,以避免对块进行多次读/写操作.
https://pypi.python.org/pypi/h5py-cache/1.0

示例_2(块大小(96,58,58)= 1,3 MB)：

import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c

def modified_chunk_size():
    File_Name_HDF5='some_Path'
    Array=np.random.rand(1,3712,3712)

    f = h5c.File(File_Name_HDF5, 'a',libver='latest', 
    chunk_cache_mem_size=6*1024**3)
    f.swmr_mode = True
    nodays=1

    shape = 96*nodays, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays):
        d[i:i+1,:,:]=Array

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5c.File(File_Name_HDF5, 'a',libver='latest', chunk_cache_mem_size=6*1024**3) #6 GB chunk chache
    f.swmr_mode = True
    d=f['variable']

    for i in xrange(0,3712,58):
        for j in xrange(0,3712,58):
            A=np.copy(d[:,i:i+58,j:j+58])

    print(time.time()-t1)

结果(写/读)：

SSD：10s / 16s

硬盘：10s / 16s

NAS：13s / 20s

通过最小化api调用(读取和写入较大的块块)可以进一步提高读/写速度.

我也想提一下她的压缩方法. Blosc可以实现高达1GB / s的吞吐量(CPU瓶颈)gzip速度较慢,但提供了更好的压缩比.

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=3)

20秒/ 30秒文件大小：101 MB

d = f.create_dataset(‘variable’,shape,maxshape =(None,3712,3712),dtype =’f’,chunks =(96,58,58),compression =’gzip’,compression_opts = 6)

50s / 58s文件大小：87 MB

d = f.create_dataset(‘variable’,shape,maxshape =(None,3712,3712),dtype =’f’,chunks =(96,58,58),compression =’gzip’,compression_opts = 9)

50s / 60s文件大小：64 MB

现在是一个月(30天)的基准.写作有点优化,用(96,3712,3712)编写.

def modified_chunk_size():
    File_Name_HDF5='some_Path'

    Array_R=np.random.rand(1,3712,3712)
    Array=np.zeros((96,3712,3712),dtype=np.float32)
    for j in xrange(0,96):
        Array[j,:,:]=Array_R

    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    nodays=30

    shape = 96, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays,96):
        d[i:i+96,:,:]=Array
        d.resize((d.shape[0]+96,shape[1],shape[2]))

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    d=f['variable']
    for i in xrange(0,3712,58):
        for j in xrange(0,3712,58):
            A=np.copy(d[:,i:i+58,j:j+58])

    print(time.time()-t1)

133s / 301s与blosc

432s / 684s,gzip compression_opts = 3

在访问NAS上的数据时遇到了同样的问题.我希望这有帮助…

标签：python,h5py,netcdf4
来源： https://codeday.me/bug/20190926/1820054.html