其他分享
首页 > 其他分享> > paddlehub(3)

paddlehub(3)

作者:互联网

本项目为疫情期间网民情绪识别比赛的解决方案。使用了PaddleHub和ERNIE实现对疫情期间微博文本的情绪识别。

数据分析

先将数据集解压

# 解压数据集
!cd data/data22724 && unzip -o test_dataset.zip
!cd data/data22724 && unzip -o "train_ dataset.zip"
Archive:  test_dataset.zip
  inflating: nCov_10k_test.csv       
Archive:  train_ dataset.zip
  inflating: nCoV_100k_train.labled.csv  
  inflating: nCoV_900k_train.unlabled.csv  

由于数据采用GB2312编码,因此先将数据读出,转换为utf8编码再重新写入,方便后续pandas库的使用处理。

# 转换编码
def re_encode(path):
    with open(path, 'r', encoding='GB2312', errors='ignore') as file:
        lines = file.readlines()
    with open(path, 'w', encoding='utf-8') as file:
        file.write(''.join(lines))
        
re_encode('data/data22724/nCov_10k_test.csv')
re_encode('data/data22724/nCoV_100k_train.labled.csv')

数据预览

读取数据,查看数据大小、列名

 

# 读取数据
import pandas as pd
train_labled = pd.read_csv('data/data22724/nCoV_100k_train.labled.csv', engine ='python')
test = pd.read_csv('data/data22724/nCov_10k_test.csv', engine ='python')

 

print(train_labled.shape)
print(test.shape)
print(train_labled.columns)
train_labled.head(3)
               微博id        微博发布时间      发布人账号  \
0  4456072029125500  01月01日 23:50     存曦1988   
1  4456074167480980  01月01日 23:58   LunaKrys   
2  4456054253264520  01月01日 22:39  小王爷学辩论o_O   

                                              微博中文内容  \
0  写在年末冬初孩子流感的第五天,我们仍然没有忘记热情拥抱这2020年的第一天。带着一丝迷信,早...   
1    开年大模型…累到以为自己发烧了腰疼膝盖疼腿疼胳膊疼脖子疼#Luna的Krystallife#?   
2  邱晨这就是我爹,爹,发烧快好,毕竟美好的假期拿来养病不太好,假期还是要好好享受快乐,爹,新年...   

                                                微博图片 微博视频 情感倾向  
0  ['https://ww2.sinaimg.cn/orj360/005VnA1zly1gah...   []    0  
1                                                 []   []   -1  
2  ['https://ww2.sinaimg.cn/thumb150/006ymYXKgy1g...   []    1  

 

(100000, 7)
(10000, 6)
Index(['微博id', '微博发布时间', '发布人账号', '微博中文内容', '微博图片', '微博视频', '情感倾向

标签分类

问题标签分为三类,分别为:1(积极),0(中性)和-1(消极)

# 标签分布
%matplotlib inline
train_labled['情感倾向'].value_counts(normalize=True).plot(kind='bar');

 <Figure size 432x288 with 1 Axes>

# 清除异常标签数据
train_labled = train_labled[train_labled['情感倾向'].isin(['-1','0','1'])]

文本长度

训练集文本长度最大为 241,平均为 87。

train_labled['微博中文内容'].str.len().describe()
count    99560.000000
mean        87.276416
std         49.355898
min          1.000000
25%         42.000000
50%         86.000000
75%        139.000000
max        241.000000
Name: 微博中文内容, dtype: float64

数据整理

我们将数据划分为训练集和验证集,比例为8:2, 然后保存为文本文件,两列需用tab分隔符隔开。

# 划分验证集,保存格式  text[\t]label
from sklearn.model_selection import train_test_split

train_labled = train_labled[['微博中文内容', '情感倾向']]
train, valid = train_test_split(train_labled, test_size=0.2, random_state=2020)
train.to_csv('/home/aistudio/data/data22724/train.txt', index=False, header=False, sep='\t')
valid.to_csv('/home/aistudio/data/data22724/valid.txt', index=False, header=False, sep='\t')

自定义数据加载

 

加载文本类自定义数据集,用户仅需要继承基类BaseNLPDatast,修改数据集存放地址以及类别即可。

 

这里我们没有带标签的测试集,所以test_file直接用验证集代替 "valid.txt" 。

 

# 自定义数据集
import os
import codecs
import csv

from paddlehub.dataset.base_nlp_dataset import BaseNLPDataset

class MyDataset(BaseNLPDataset):
    """DemoDataset"""
    def __init__(self):
        # 数据集存放位置
        self.dataset_dir = "/home/aistudio/data/data22724"
        super(MyDataset, self).__init__(
            base_path=self.dataset_dir,
            train_file="train.txt",
            dev_file="valid.txt",
            test_file="valid.txt",
            train_file_with_header=False,
            dev_file_with_header=False,
            test_file_with_header=False,
            # 数据集类别集合
            label_list=["-1", "0", "1"])

dataset = MyDataset()
for e in dataset.get_train_examples()[:3]:
    print("{}\t{}\t{}".format(e.guid, e.text_a, e.label))
0    【#吃中药后新冠肺炎患者紧张心理得到缓解#生病时,你一般吃中药还是西药?】国家中医药管理局医疗救治专家组组长、中国工程院院士、中国中医科学院院长黄璐琦14日介绍,在新冠肺炎患者救治中,老百姓对中医药有种迫切需求,吃了中医药后紧张心理得到一定程度缓解。生病时,你一般吃中药还是西药?    0
1    又是上班的一天今天依然很爱很爱我的宝贝?    1
2    //@淡梦就是爱吐槽:还有患渐冻症不离一线的张定宇,高龄赶赴现场的钟南山、李兰娟,抗击疫病的医生,

加载模型

# 加载模型
import paddlehub as hub
module = hub.Module(name="chinese-roberta-wwm-ext")
[2020-08-30 17:40:14,124] [    INFO] - Installing chinese-roberta-wwm-ext module
Downloading chinese-roberta-wwm-ext
[==================================================] 100.00%
Uncompress /home/aistudio/.paddlehub/tmp/tmpdrngwfdm/chinese-roberta-wwm-ext
[==================================================] 100.00%
[2020-08-30 17:52:25,274] [    INFO] - Successfully installed chinese-roberta-wwm-ext-1.0.0

构建Reader

构建一个文本分类的reader,reader负责将dataset的数据进行预处理,首先对文本进行切词,接着以特定格式组织并输入给模型进行训练。

# 构建Reader
reader = hub.reader.ClassifyReader(
    dataset=dataset,
    vocab_path=module.get_vocab_path(),
    sp_model_path=module.get_spm_path(),
    word_dict_path=module.get_word_dict_path(),
    max_seq_len=128)
[2020-08-30 17:52:25,291] [    INFO] - Dataset label map = {'-1': 0, '0': 1, '1': 2}

finetune策略

# finetune策略
strategy = hub.AdamWeightDecayStrategy(
    weight_decay=0.01,
    warmup_proportion=0.1,
    learning_rate=5e-5)

运行配置

设置训练时的epoch,batch_size,模型储存路径等参数。

这里我们设置训练轮数 num_epoch = 1,模型保存路径 checkpoint_dir="model", 每100轮(eval_interval)对验证集验证一次评分,并保存最优模型。

# 运行配置
config = hub.RunConfig(
    use_cuda=True,
    num_epoch=1,
    checkpoint_dir="model_bert",
    batch_size=32,
    eval_interval=100,
    strategy=strategy)
[2020-08-30 17:52:25,337] [    INFO] - Checkpoint dir: model_bert

组建Finetune Task

对于文本分类任务,我们需要获取模型的池化层输出,并在后面接上全连接层实现分类。

因此,我们先获取module的上下文环境,包括输入和输出的变量,并从中获取池化层输出作为文本特征。再接入一个全连接层,生成Task。

比赛评价指标为F1,因此设置metrics_choices=["f1"]

# Finetune Task
inputs, outputs, program = module.context(
    trainable=True, max_seq_len=128)

# Use "pooled_output" for classification tasks on an entire sentence.
pooled_output = outputs["pooled_output"]

feed_list = [
    inputs["input_ids"].name,
    inputs["position_ids"].name,
    inputs["segment_ids"].name,
    inputs["input_mask"].name,
]

cls_task = hub.TextClassifierTask(
        data_reader=reader,
        feature=pooled_output,
        feed_list=feed_list,
        num_classes=dataset.num_labels,
        config=config,
        metrics_choices=["f1"])
[2020-08-30 17:52:29,000] [    INFO] - Load pretraining parameters from /home/aistudio/.paddlehub/modul

开始finetune

我们使用finetune_and_eval接口就可以开始模型训练,finetune过程中,会周期性的进行模型效果的评估。

# finetune
run_states = cls_task.finetune_and_eval()
[2020-08-30 17:52:32,906] [    INFO] - Strategy with warmup, linear decay, slanted triangle learning rate, weight decay regularization, 
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py:795: UserWarning: The current program is empty.
  warnings.warn(error_info)
[2020-08-30 17:52:38,879] [    INFO] - Try loading checkpoint from model_bert/ckpt.meta
[2020-08-30 17:52:38,881] [    INFO] - PaddleHub model checkpoint not found, start from scratch...
[2020-08-30 17:52:38,957] [    INFO] - PaddleHub finetune start
[2020-08-30 17:52:42,695] [   TRAIN] - step 10 / 2497: loss=1.01098 f1=0.85000 [step/sec: 2.88]
[2020-08-30 17:52:45,246] [   TRAIN] - step 20 / 2497: loss=0.97925 f1=0.87943 [step/sec: 3.92]
[2020-08-30 17:52:47,831] [   TRAIN] - step 30 / 2497: loss=0.98047 f1=0.87037 [step/sec: 3.89]
[2020-08-30 17:52:50,411] [   TRAIN] - step 40 / 2497: loss=1.01527 f1=0.84422 [step/sec: 3.89]
[2020-08-30 17:52:53,007] [   TRAIN] - step 50 / 2497: loss=0.97841 f1=0.85195 [step/sec: 3.87]
[2020-08-30 17:52:55,597] [   TRAIN] - step 60 / 2497: loss=0.84828 f1=0.90353 [step/sec: 3.87]
[2020-08-30 17:52:58,183] [   TRAIN] - step 70 / 2497: loss=0.81307 f1=0.88778 [step/sec: 3.88]
[2020-08-30 17:53:00,820] [   TRAIN] - step 80 / 2497: loss=0.73095 f1=0.89620 [step/sec: 3.82]
[2020-08-30 17:53:03,420] [   TRAIN] - step 90 / 2497: loss=0.75200 f1=0.84367 [step/sec: 3.86]
[2020-08-30 17:53:06,008] [   TRAIN] - step 100 / 2497: loss=0.78772 f1=0.86647 [step/sec: 3.88]
[2020-08-30 17:53:06,009] [    INFO] - Evaluation on dev dataset start
share_vars_from is set, scope is ignored.
[2020-08-30 17:53:55,638] [    EVAL] - [dev dataset evaluation result] loss=0.69270 f1=0.87430 [step/sec: 12.82]
[2020-08-30 17:53:55,639] [    EVAL] - best model saved to model_bert/best_model [best f1=0.87430]
[2020-08-30 17:54:00,583] [   TRAIN] - step 110 / 2497: loss=0.71252 f1=0.86544 [step/sec: 3.87]
[2020-08-30 17:54:03,182] [   TRAIN] - step 120 / 2497: loss=0.71817 f1=0.86262 [step/sec: 3.85]
[2020-08-30 17:54:05,778] [   TRAIN] - step 130 / 2497: loss=0.67883 f1=0.88298 [step/sec: 3.86]
[2020-08-30 17:54:08,372] [   TRAIN] - step 140 / 2497: loss=0.71395 f1=0.87613 [step/sec: 3.86]
[2020-08-30 17:54:10,967] [   TRAIN] - step 150 / 2497: loss=0.63634 f1=0.90230 [step/sec: 3.86]
[2020-08-30 17:54:13,561] [   TRAIN] - step 160 / 2497: loss=0.69675 f1=0.89634 [step/sec: 3.86]
[2020-08-30 17:54:16,157] [   TRAIN] - step 170 / 2497: loss=0.65184 f1=0.89164 [step/sec: 3.86]
[2020-08-30 17:54:18,755] [   TRAIN] - step 180 / 2497: loss=0.65731 f1=0.89655 [step/sec: 3.85]
[2020-08-30 17:54:21,347] [   TRAIN] - step 190 / 2497: loss=0.57865 f1=0.89736 [step/sec: 3.86]
[2020-08-30 17:54:23,948] [   TRAIN] - step 200 / 2497: loss=0.62045 f1=0.87647 [step/sec: 3.85]
[2020-08-30 17:54:23,949] [    INFO] - Evaluation on dev dataset start
[2020-08-30 17:55:11,823] [    EVAL] - [dev dataset evaluation result] loss=0.62273 f1=0.88890 [step/sec: 13.06]
[2020-08-30 17:55:11,824] [    EVAL] - best model saved to model_bert/best_model [best f1=0.88890]
[2020-08-30 17:55:17,542] [   TRAIN] - step 210 / 2497: loss=0.67096 f1=0.89655 [step/sec: 3.82]
[2020-08-30 17:55:20,158] [   TRAIN] - step 220 / 2497: loss=0.64818 f1=0.87460 [step/sec: 3.84]
[2020-08-30 17:55:22,767] [   TRAIN] - step 230 / 2497: loss=0.62531 f1=0.87834 [step/sec: 3.84]
[2020-08-30 17:55:25,377] [   TRAIN] - step 240 / 2497: loss=0.63616 f1=0.85993 [step/sec: 3.84]
[2020-08-30 17:55:27,994] [   TRAIN] - step 250 / 2497: loss=0.72842 f1=0.88949 [step/sec: 3.84]
[2020-08-30 17:55:30,605] [   TRAIN] - step 260 / 2497: loss=0.62879 f1=0.88462 [step/sec: 3.84]
[2020-08-30 17:55:33,227] [   TRAIN] - step 270 / 2497: loss=0.58715 f1=0.88136 [step/sec: 3.84]
[2020-08-30 17:55:35,815] [   TRAIN] - step 280 / 2497: loss=0.69536 f1=0.89914 [step/sec: 3.87]
[2020-08-30 17:55:38,425] [   TRAIN] - step 290 / 2497: loss=0.64157 f1=0.85897 [step/sec: 3.83]
[2020-08-30 17:55:41,027] [   TRAIN] - step 300 / 2497: loss=0.58968 f1=0.87135 [step/sec: 3.85]
[2020-08-30 17:55:41,029] [    INFO] - Evaluation on dev dataset start
[2020-08-30 17:56:29,689] [    EVAL] - [dev dataset evaluation result] loss=0.65638 f1=0.88808 [step/sec: 12.85]
[2020-08-30 17:56:32,335] [   TRAIN] - step 310 / 2497: loss=0.62983 f1=0.89944 [step/sec: 3.81]
[2020-08-30 17:56:34,964] [   TRAIN] - step 320 / 2497: loss=0.66090 f1=0.85512 [step/sec: 3.83]
[2020-08-30 17:56:37,585] [   TRAIN] - step 330 / 2497: loss=0.58643 f1=0.92183 [step/sec: 3.83]
[2020-08-30 17:56:40,207] [   TRAIN] - step 340 / 2497: loss=0.63230 f1=0.86726 [step/sec: 3.83]
[2020-08-30 17:56:42,832] [   TRAIN] - step 350 / 2497: loss=0.57614 f1=0.91560 [step/sec: 3.82]
[2020-08-30 17:56:45,450] [   TRAIN] - step 360 / 2497: loss=0.60857 f1=0.88827 [step/sec: 3.83]
[2020-08-30 17:56:48,085] [   TRAIN] - step 370 / 2497: loss=0.68148 f1=0.86053 [step/sec: 3.82]
[2020-08-30 17:56:50,723] [   TRAIN] - step 380 / 2497: loss=0.61660 f1=0.88050 [step/sec: 3.80]
[2020-08-30 17:56:53,342] [   TRAIN] - step 390 / 2497: loss=0.70684 f1=0.87390 [step/sec: 3.83]
[2020-08-30 17:56:55,983] [   TRAIN] - step 400 / 2497: loss=0.56253 f1=0.90237 [step/sec: 3.80]
[2020-08-30 17:56:55,985] [    INFO] - Evaluation on dev dataset start
[2020-08-30 17:57:44,236] [    EVAL] - [dev dataset evaluation result] loss=0.61991 f1=0.89283 [step/sec: 12.96]
[2020-08-30 17:57:44,237] [    EVAL] - best model saved to model_bert/best_model [best f1=0.89283]
[2020-08-30 17:57:49,938] [   TRAIN] - step 410 / 2497: loss=0.61865 f1=0.91613 [step/sec: 3.79]
[2020-08-30 17:57:52,559] [   TRAIN] - step 420 / 2497: loss=0.65938 f1=0.87536 [step/sec: 3.83]
[2020-08-30 17:57:55,190] [   TRAIN] - step 430 / 2497: loss=0.66385 f1=0.91077 [step/sec: 3.83]
[2020-08-30 17:57:57,813] [   TRAIN] - step 440 / 2497: loss=0.56204 f1=0.92079 [step/sec: 3.83]
[2020-08-30 17:58:00,443] [   TRAIN] - step 450 / 2497: loss=0.60971 f1=0.90265 [step/sec: 3.81]
[2020-08-30 17:58:03,069] [   TRAIN] - step 460 / 2497: loss=0.61614 f1=0.87385 [step/sec: 3.83]
[2020-08-30 17:58:05,690] [   TRAIN] - step 470 / 2497: loss=0.58309 f1=0.89286 [step/sec: 3.83]
[2020-08-30 17:58:08,314] [   TRAIN] - step 480 / 2497: loss=0.67171 f1=0.87385 [step/sec: 3.82]
[2020-08-30 17:58:10,931] [   TRAIN] - step 490 / 2497: loss=0.70945 f1=0.83735 [step/sec: 3.83]
[2020-08-30 17:58:13,558] [   TRAIN] - step 500 / 2497: loss=0.56513 f1=0.91971 [step/sec: 3.83]
[2020-08-30 17:58:13,559] [    INFO] - Evaluation on dev dataset start
[2020-08-30 17:59:02,593] [    EVAL] - [dev dataset evaluation result] loss=0.59540 f1=0.89030 [step/sec: 12.75]
[2020-08-30 17:59:05,216] [   TRAIN] - step 510 / 2497: loss=0.58815 f1=0.86782 [step/sec: 3.83]
[2020-08-30 17:59:07,839] [   TRAIN] - step 520 / 2497: loss=0.56237 f1=0.88254 [step/sec: 3.82]
[2020-08-30 17:59:10,465] [   TRAIN] - step 530 / 2497: loss=0.56160 f1=0.88485 [step/sec: 3.82]
[2020-08-30 17:59:13,087] [   TRAIN] - step 540 / 2497: loss=0.64510 f1=0.90196 [step/sec: 3.82]
[2020-08-30 17:59:15,716] [   TRAIN] - step 550 / 2497: loss=0.53364 f1=0.91831 [step/sec: 3.82]
[2020-08-30 17:59:18,370] [   TRAIN] - step 560 / 2497: loss=0.53434 f1=0.87948 [step/sec: 3.80]
[2020-08-30 17:59:20,996] [   TRAIN] - step 570 / 2497: loss=0.63068 f1=0.86624 [step/sec: 3.82]
[2020-08-30 17:59:23,630] [   TRAIN] - step 580 / 2497: loss=0.54734 f1=0.89908 [step/sec: 3.81]
[2020-08-30 17:59:26,263] [   TRAIN] - step 590 / 2497: loss=0.60503 f1=0.90032 [step/sec: 3.81]
[2020-08-30 17:59:28,896] [   TRAIN] - step 600 / 2497: loss=0.68594 f1=0.82779 [step/sec: 3.81]
[2020-08-30 17:59:28,897] [    INFO] - Evaluation on dev dataset start
[2020-08-30 18:00:17,374] [    EVAL] - [dev dataset evaluation result] loss=0.60199 f1=0.88333 [step/sec: 12.90]
[2020-08-30 18:00:20,015] [   TRAIN] - step 610 / 2497: loss=0.55796 f1=0.91071 [step/sec: 3.80]
[2020-08-30 18:00:22,661] [   TRAIN] - step 620 / 2497: loss=0.61698 f1=0.87948 [step/sec: 3.79]
[2020-08-30 18:00:25,299] [   TRAIN] - step 630 / 2497: loss=0.66055 f1=0.87879 [step/sec: 3.80]
[2020-08-30 18:00:27,950] [   TRAIN] - step 640 / 2497: loss=0.61488 f1=0.89865 [step/sec: 3.79]
[2020-08-30 18:00:30,603] [   TRAIN] - step 650 / 2497: loss=0.62346 f1=0.86740 [step/sec: 3.78]
[2020-08-30 18:00:33,252] [   TRAIN] - step 660 / 2497: loss=0.51533 f1=0.91922 [step/sec: 3.79]
[2020-08-30 18:00:35,897] [   TRAIN] - step 670 / 2497: loss=0.64104 f1=0.89444 [step/sec: 3.78]
[2020-08-30 18:00:38,553] [   TRAIN] - step 680 / 2497: loss=0.54168 f1=0.92222 [step/sec: 3.80]
[2020-08-30 18:00:41,180] [   TRAIN] - step 690 / 2497: loss=0.55047 f1=0.89614 [step/sec: 3.82]
[2020-08-30 18:00:43,814] [   TRAIN] - step 700 / 2497: loss=0.60918 f1=0.87293 [step/sec: 3.81]
[2020-08-30 18:00:43,816] [    INFO] - Evaluation on dev dataset start
[2020-08-30 18:01:32,832] [    EVAL] - [dev dataset evaluation result] loss=0.57981 f1=0.89660 [step/sec: 12.76]
[2020-08-30 18:01:32,838] [    EVAL] - best model saved to model_bert/best_model [best f1=0.89660]
[2020-08-30 18:01:38,531] [   TRAIN] - step 710 / 2497: loss=0.63056 f1=0.86957 [step/sec: 3.79]
[2020-08-30 18:01:41,160] [   TRAIN] - step 720 / 2497: loss=0.57240 f1=0.89021 [step/sec: 3.82]
[2020-08-30 18:01:43,785] [   TRAIN] - step 730 / 2497: loss=0.58741 f1=0.89544 [step/sec: 3.82]
[2020-08-30 18:01:46,422] [   TRAIN] - step 740 / 2497: loss=0.62016 f1=0.88089 [step/sec: 3.80]
[2020-08-30 18:01:49,049] [   TRAIN] - step 750 / 2497: loss=0.56928 f1=0.87791 [step/sec: 3.82]
[2020-08-30 18:01:51,673] [   TRAIN] - step 760 / 2497: loss=0.63097 f1=0.87838 [step/sec: 3.82]
[2020-08-30 18:01:54,294] [   TRAIN] - step 770 / 2497: loss=0.64109 f1=0.88506 [step/sec: 3.85]
[2020-08-30 18:01:56,924] [   TRAIN] - step 780 / 2497: loss=0.63887 f1=0.88219 [step/sec: 3.82]
[2020-08-30 18:01:59,551] [   TRAIN] - step 790 / 2497: loss=0.63866 f1=0.88764 [step/sec: 3.83]
[2020-08-30 18:02:02,175] [   TRAIN] - step 800 / 2497: loss=0.63746 f1=0.88166 [step/sec: 3.82]
[2020-08-30 18:02:02,177] [    INFO] - Evaluation on dev dataset start
[2020-08-30 18:02:52,189] [    EVAL] - [dev dataset evaluation result] loss=0.57340 f1=0.89606 [step/sec: 12.51]
[2020-08-30 18:02:54,873] [   TRAIN] - step 810 / 2497: loss=0.53743 f1=0.90544 [step/sec: 3.75]
[2020-08-30 18:02:57,530] [   TRAIN] - step 820 / 2497: loss=0.67242 f1=0.87037 [step/sec: 3.78]
[2020-08-30 18:03:00,225] [   TRAIN] - step 830 / 2497: loss=0.60064 f1=0.87540 [step/sec: 3.73]
[2020-08-30 18:03:02,910] [   TRAIN] - step 840 / 2497: loss=0.65536 f1=0.90244 [step/sec: 3.74]
[2020-08-30 18:03:05,574] [   TRAIN] - step 850 / 2497: loss=0.65970 f1=0.88184 [step/sec: 3.77]
[2020-08-30 18:03:08,221] [   TRAIN] - step 860 / 2497: loss=0.51493 f1=0.92958 [step/sec: 3.79]
[2020-08-30 18:03:10,889] [   TRAIN] - step 870 / 2497: loss=0.60079 f1=0.88587 [step/sec: 3.78]
[2020-08-30 18:03:13,551] [   TRAIN] - step 880 / 2497: loss=0.61113 f1=0.88089 [step/sec: 3.78]
[2020-08-30 18:03:16,196] [   TRAIN] - step 890 / 2497: loss=0.55213 f1=0.90426 [step/sec: 3.79]
[2020-08-30 18:03:18,831] [   TRAIN] - step 900 / 2497: loss=0.61766 f1=0.89256 [step/sec: 3.81]
[2020-08-30 18:03:18,833] [    INFO] - Evaluation on dev dataset start
[2020-08-30 18:04:08,740] [    EVAL] - [dev dataset evaluation result] loss=0.56590 f1=0.89698 [step/sec: 12.53]
[2020-08-30 18:04:08,741] [    EVAL] - best model saved to model_bert/best_model [best f1=0.89698]
[2020-08-30 18:04:14,517] [   TRAIN] - step 910 / 2497: loss=0.51508 f1=0.91343 [step/sec: 3.79]
[2020-08-30 18:04:17,177] [   TRAIN] - step 920 / 2497: loss=0.59504 f1=0.89973 [step/sec: 3.77]
[2020-08-30 18:04:19,842] [   TRAIN] - step 930 / 2497: loss=0.61012 f1=0.89937 [step/sec: 3.77]
[2020-08-30 18:04:22,505] [   TRAIN] - step 940 / 2497: loss=0.59346 f1=0.87697 [step/sec: 3.77]
[2020-08-30 18:04:25,163] [   TRAIN] - step 950 / 2497: loss=0.62892 f1=0.90789 [step/sec: 3.77]
[2020-08-30 18:04:27,767] [   TRAIN] - step 960 / 2497: loss=0.66066 f1=0.90341 [step/sec: 3.85]
[2020-08-30 18:04:30,411] [   TRAIN] - step 970 / 2497: loss=0.58365 f1=0.89595 [step/sec: 3.79]
[2020-08-30 18:04:33,056] [   TRAIN] - step 980 / 2497: loss=0.64528 f1=0.86957 [step/sec: 3.81]
[2020-08-30 18:04:35,682] [   TRAIN] - step 990 / 2497: loss=0.58250 f1=0.90860 [step/sec: 3.82]
[2020-08-30 18:04:38,320] [   TRAIN] - step 1000 / 2497: loss=0.61420 f1=0.87147 [step/sec: 3.80]
[2020-08-30 18:04:38,324] [    INFO] - Evaluation on dev dataset start
[2020-08-30 18:05:27,889] [    EVAL] - [dev dataset evaluation result] loss=0.57238 f1=0.88845 [step/sec: 12.62]
[2020-08-30 18:05:30,513] [   TRAIN] - step 1010 / 2497: loss=0.51664 f1=0.90588 [step/sec: 3.82]
[2020-08-30 18:05:33,149] [   TRAIN] - step 1020 / 2497: loss=0.61562 f1=0.87943 [step/sec: 3.80]
[2020-08-30 18:05:35,772] [   TRAIN] - step 1030 / 2497: loss=0.59241 f1=0.87240 [step/sec: 3.83]
[2020-08-30 18:05:38,417] [   TRAIN] - step 1040 / 2497: loss=0.57977 f1=0.85714 [step/sec: 3.79]
[2020-08-30 18:05:41,051] [   TRAIN] - step 1050 / 2497: loss=0.53221 f1=0.90566 [step/sec: 3.81]
[2020-08-30 18:05:43,684] [   TRAIN] - step 1060 / 2497: loss=0.55388 f1=0.90208 [step/sec: 3.81]
[2020-08-30 18:05:46,311] [   TRAIN] - step 1070 / 2497: loss=0.56919 f1=0.88372 [step/sec: 3.82]
[2020-08-30 18:05:48,949] [   TRAIN] - step 1080 / 2497: loss=0.55938 f1=0.88068 [step/sec: 3.81]
[2020-08-30 18:05:51,576] [   TRAIN] - step 1090 / 2497: loss=0.58166 f1=0.90306 [step/sec: 3.82]
[2020-08-30 18:05:54,205] [   TRAIN] - step 1100 / 2497: loss=0.54135 f1=0.91553 [step/sec: 3.81]
[2020-08-30 18:05:54,207] [    INFO] - Evaluation on dev dataset start
[2020-08-30 18:06:43,573] [    EVAL] - [dev dataset evaluation result] loss=0.57407 f1=0.89583 [step/sec: 12.67]
[2020-08-30 18:06:46,207] [   TRAIN] - step 1110 / 2497: loss=0.70064 f1=0.88050 [step/sec: 3.82]
[2020-08-30 18:06:48,840] [   TRAIN] - step 1120 / 2497: loss=0.60383 f1=0.87097 [step/sec: 3.81]
[2020-08-30 18:06:51,484] [   TRAIN] - step 1130 / 2497: loss=0.57912 f1=0.86310 [step/sec: 3.80]
[2020-08-30 18:06:54,124] [   TRAIN] - step 1140 / 2497: loss=0.53215 f1=0.90341 [step/sec: 3.81]
[2020-08-30 18:06:56,759] [   TRAIN] - step 1150 / 2497: loss=0.56618 f1=0.88636 [step/sec: 3.81]
[2020-08-30 18:06:59,397] [   TRAIN] - step 1160 / 2497: loss=0.54743 f1=0.88515 [step/sec: 3.81]
[2020-08-30 18:07:02,035] [   TRAIN] - step 1170 / 2497: loss=0.55637 f1=0.90361 [step/sec: 3.80]
[2020-08-30 18:07:04,671] [   TRAIN] - step 1180 / 2497: loss=0.57800 f1=0.90704 [step/sec: 3.81]
[2020-08-30 18:07:07,294] [   TRAIN] - step 1190 / 2497: loss=0.60205 f1=0.88474 [step/sec: 3.82]
[2020-08-30 18:07:09,922] [   TRAIN] - step 1200 / 2497: loss=0.53961 f1=0.90303 [step/sec: 3.81]
[2020-08-30 18:07:09,924] [    INFO] - Evaluation on dev dataset start

预测

finetune完成后,调用predict接口即可完成预测

预测数据格式为二维的list:

[['第一条文本'], ['第二条文本'], [...], ...]

# 预测
import numpy as np
    
inv_label_map = {val: key for key, val in reader.label_map.items()}

# Data to be prdicted
data = test[['微博中文内容']].fillna(' ').values.tolist()

run_states = cls_task.predict(data=data)
results = [run_state.run_results for run_state in run_states]

生成结果

# 生成预测结果
proba = np.vstack([r[0] for r in results])
prediction = list(np.argmax(proba, axis=1))
prediction = [inv_label_map[p] for p in prediction]
        
submission = pd.DataFrame()
submission['id'] = test['微博id'].values
submission['id'] = submission['id'].astype(str) + ' '
submission['y'] = prediction
np.save('proba.npy', proba)
submission.to_csv('result_bert.csv', index=False)
submission.head()
submission['text'] = test[['微博中文内容']].fillna(' ').values
submission['label'] = submission['y'].map({-1: '消极', 0: '中性', 1: '积极'})
display(submission[['text', 'label']][176:181])

可以看到网民情绪可以被模型成功识别。最后提交结果的线上分数为 F1 = 0.7128。

总结

经过这一段时间的暑期实践的学习,我学习到了关于python的基本知识,也学会了在paddlepaddle上运行一些基本的项目。虽然中间遇到了一些困难,但是经过在网上查找资料,逐渐解决了这些困难。但是还有许多不足的地方,所以以后要继续努力,多多练习。

 

标签:f1,paddlehub,08,30,step,2020,sec
来源: https://www.cnblogs.com/lyh20010804/p/13588082.html