首页 > 其他分享> > 为何此分层Poisson模型与生成的数据中的真实参数不匹配？

为何此分层Poisson模型与生成的数据中的真实参数不匹配？

2019-11-19 22:57:02 作者：互联网

我正在尝试拟合分层Poisson回归以估计每个组和全局的time_delay.我对pymc是否自动将日志链接功能应用于mu感到困惑,还是我必须明确地这样做：

with pm.Model() as model:
    alpha = pm.Gamma('alpha', alpha=1, beta=1)
    beta = pm.Gamma('beta', alpha=1, beta=1)

    a = pm.Gamma('a', alpha=alpha, beta=beta, shape=n_participants)

    mu = a[participants_idx]
    y_est = pm.Poisson('y_est', mu=mu, observed=messages['time_delay'].values)

    start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
    step = pm.Metropolis(start=start)
    trace = pm.sample(20000, step, start=start, progressbar=True)

下面的跟踪图显示了对a的估计.您可以看到0到750之间的分组估算值.

当我使用alpha和beta的平均值作为参数绘制超参数伽玛分布时,我的困惑就开始了.以下分布显示了大约0到5之间的支撑.在查看以上估算值时,这与我的预期不符.代表什么？是log(a)还是其他？

感谢您的指导.

根据注释中的要求添加使用伪数据的示例：该示例只有一个组,因此应该更容易看出hyper参数是否可以合理地产生该组的Poisson分布.

test_data = []
model = []

for i in np.arange(1):
    # between 1 and 100 messages per conversation
    num_messages = np.random.uniform(1, 100)
    avg_delay = np.random.gamma(15, 1)
    for j in np.arange(num_messages):
        delay = np.random.poisson(avg_delay)

        test_data.append([i, j, delay, i])

    model.append([i, avg_delay])

model_df = pd.DataFrame(model, columns=['conversation_id', 'synthetic_mean_delay'])
test_df = pd.DataFrame(test_data, columns=['conversation_id', 'message_id', 'time_delay', 'participants_str'])
test_df.head()

# Estimate parameters of model using test data
# convert categorical variables to integer
le = preprocessing.LabelEncoder()
test_participants_map = le.fit(test_df['participants_str'])
test_participants_idx = le.fit_transform(test_df['participants_str'])
n_test_participants = len(test_df['participants_str'].unique())

with pm.Model() as model:
    alpha = pm.Gamma('alpha', alpha=1, beta=1)    
    beta = pm.Gamma('beta', alpha=1, beta=1)

    a = pm.Gamma('a', alpha=alpha, beta=beta, shape=n_test_participants)

    mu = a[test_participants_idx]

    y = test_df['time_delay'].values
    y_est = pm.Poisson('y_est', mu=mu, observed=y)

    start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
    step = pm.Metropolis(start=start)
    trace = pm.sample(20000, step, start=start, progressbar=True)

我看不到以下超级参数如何产生参数在13到17之间的泊松分布.

解决方法:

解答：pymc使用不同于scipy的参数来表示Gamma分布. scipy使用alpha&缩放,而pymc使用alpha和beta.下面的模型按预期工作：

with pm.Model() as model:
    alpha = pm.Gamma('alpha', alpha=1, beta=1)    
    scale = pm.Gamma('scale', alpha=1, beta=1)

    a = pm.Gamma('a', alpha=alpha, beta=1.0/scale, shape=n_test_participants)

    #mu = T.exp(a[test_participants_idx])
    mu = a[test_participants_idx]

    y = test_df['time_delay'].values
    y_est = pm.Poisson('y_est', mu=mu, observed=y)

    start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
    step = pm.Metropolis(start=start)
    trace = pm.sample(20000, step, start=start, progressbar=True)

标签：mcmc,pymc,python
来源： https://codeday.me/bug/20191119/2039624.html