其他分享
首页 > 其他分享> > NumPy:计算两个数组之间的按行相交的大小

NumPy:计算两个数组之间的按行相交的大小

作者:互联网

我有2个数组,这些数组由小于100的整数填充.一个数字不能连续出现两次.

>数组1:nrow = 100 000; ncol = 5
>数组2:nrow = 50000; ncol = 5

我想创建第三个数组(交集),在Array1的每一行和Array2的每一行之间使用相似元素的数量.

def Intersection(array1, array2):
    Intersection = np.empty([ array1.shape[0]  , array2.shape[0] ], dtype=int8)
    for i in range(0, array1.shape[0]):
        for j in range(0, array2.shape[0]):
            Intersection[i,j] = len( set(array1[i,]).intersection(array2[j,]) )
    return Intersection

这是一个例子:

array1 = np.array([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [7,8,9,10,11] ])
array2 = np.array([[1, 3, 7, 20, 21], [1, 43, 104, 115, 116], [6,30,91,110,121] ])
#Expected result:
array([[2, 1, 0],
       [1, 0, 1],
       [1, 0, 0]], dtype=int8)

这种带有嵌套循环的幼稚解决方案非常慢.我该如何对其向量化?

解决方法:

对于大小为100000和50000的输入,这是一种花费大约一秒钟的方法:

import numpy as np
import scipy.sparse

def intersection_counts(x,y):
    # find the size of the input arrays
    n_x, n_d = x.shape
    n_y, n_d = y.shape

    # get a list of the unique values appearing in x and y, as well
    # as a list of inverse indices (see docs for np.unique)
    values, ix = np.unique(np.vstack((x,y)), return_inverse=True)
    n_unique = len(values)

    # reshape the inverse array. ix_x_hat will be an array the same size
    # as x, where ix_x_hat[i,j] gives the index of x[i,j] in values. That
    # is to say, values[ix_x_hat[i,j]] == x[i,j]
    ix_hat = ix.reshape(-1, n_d)
    ix_x_hat = ix_hat[:n_x]
    ix_y_hat = ix_hat[n_x:]

    # create a sparse matrix where entry [i,j] is 1 if and only if
    # row i of x contains values[j]
    x_hat = scipy.sparse.lil_matrix((n_x, n_unique), dtype=int)
    x_hat[np.arange(n_x)[:,None], ix_x_hat] = 1

    # create a sparse matrix where entry [i,j] is 1 if and only if
    # row i of y contains values[j]
    y_hat = scipy.sparse.lil_matrix((len(y), len(values)), dtype=int)
    y_hat[np.arange(n_y)[:,None], ix_y_hat] = 1

    # the dot product gives the solution
    return x_hat.dot(y_hat.T)

这是一个想法:假设x和y的每个条目都采用一些小的集合中的值,例如values = [1,3,6,9,11,15,28,40].考虑一行x:

x[0] = [11, 6, 40, 1, 3]

和一行y:

y[0] = [3, 11, 6, 9, 15]

我们可以将x [0]表示为与值长度相同的稀疏向量.如果ith值出现在x中,则ith条目将为1:

#          [1, 3, 6, 9,11,15,28,40]
x_hat[0] = [1, 1, 1, 0, 1, 0, 0, 1]
y_hat[0] = [0, 1, 1, 1, 1, 1, 0, 0]

x_hat和y_hat之间的交点中有多少个元素?它只是点积:3.上面的代码只是批量执行此操作.

该函数适用于稀疏矩阵,结果是一个稀疏矩阵,以节省内存.请注意,密集的100000 x 50000的int32数组已经是20 GB,这可能会或可能不会超过您的RAM.有关使用稀疏数组的帮助,请参见here.

我通过生成带有x和y的数组来测试上述代码:

x = np.random.randint(0,1000,(100000,5))
y = np.random.randint(0,1000,(50000,5))

它在具有24GB主内存的5岁机器上仅用2秒即可完成.此处,1000和x和y可以取的可能值的范围.使其更小意味着所涉及的矩阵将更少稀疏,并且代码将花费更长的时间.

标签:vectorization,arrays,python,numpy
来源: https://codeday.me/bug/20191121/2048475.html