其他分享
首页 > 其他分享> > 数据分析入门(学术前沿趋势分析)Task1-论文数据统计

数据分析入门(学术前沿趋势分析)Task1-论文数据统计

作者:互联网

此次赛题是零基础入门数据分析(学术前沿趋势分析),使用公开的arXiv论文完成对应的数据分析操作。赛题内容包括对论文数量、作者出现频率、论文源码的统计,对论文进行分类以及对论文作者的关系进行建模。

目录

Ⅰ、数据及背景

Ⅱ、数据集介绍

Ⅲ、代码实现

Ⅰ、环境准备

Ⅱ、数据预处理

Ⅲ、数据分析及可视化

Ⅳ、参考资料


Ⅰ、数据及背景

  1. 主题:统计论文数量
  2. 内容:理解赛题、学习利用 Pandas 读取数据并进行统计
  3. 数据集:arXiv 重要的学术公开网站,也是搜索、浏览和下载学术论文的重要工具。arXiv论文涵盖的范围非常广,涉及物理学的庞大分支和计算机科学的众多子学科,如数学、统计学、电气工程、定量生物学和经济学等等。

Ⅱ、数据集介绍

数据集链接

数据集详细介绍链接

数据集部分介绍如下:

id

arXiv,可用于访问论文

submitter

论文提交者

authors

论文作者

title

论文标题

comments

论文页数和图表等其他信息

journal-ref

论文发表的期刊的信息

doi

数字对象标识符

report-no

报告编号

categories

论文在arXiv系统的所属类别或标签

license

文章的许可证

abstract

论文摘要

versions

论文版本

authors_parsed

作者的信息

astro-ph

天体物理学(Astrophysics)

astro-ph.CO

宇宙学与非银河系天体物理学(Cosmology and Nongalactic Astrophysics)

astro-ph.EP

地球与行星天体物理学(Earth and Planetary Astrophysics)

astro-ph.GA

星系(Astrophysics of Galaxies)

cs.AI

Artificial Intelligence

cs.AR

硬件体系结构(Hardware Architecture)

cs.CC

计算复杂度(Computational Complexity)

cs.CE

计算工程、金融与科学(Computational Engineering, Finance, and Science)

cs.CV

计算机视觉与模式识别(Computer Vision and Pattern Recognition)

cs.CY

计算机与社会(Computers and Society)

cs.DB

数据库(Databases)

cs.DC

分布式、并行与集群计算(Distributed, Parallel, and Cluster Computing)

cs.DL

数字图书馆(Digital Libraries)

cs.NA

数值分析(Numerical Analysis)

cs.NE

神经进化计算(Neural and Evolutionary Computing)

cs.NI

网络与互联网架构(Networking and Internet Architecture)

cs.OH

其他计算机科学(Other Computer Science)

cs.OS

操作系统(Operating Systems)

Ⅲ、代码实现

Ⅰ、环境准备

所需要的模块包括:seaborn、BeautifulSoup、requests、json、pandas、matplotlib。

Ⅱ、数据预处理

读取数据:

import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图工具
# 数据导入
data  = [] #初始化
#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open(r"D:/xyc/competPractice/dataAnalysis2101/archive/arxiv-metadata-oai-snapshot.json", 'r') as f: 
    for line in f: 
        data.append(json.loads(line))
        
data = pd.DataFrame(data) #将list变为dataframe格式,方便使用pandas进行分析
print(data.shape) #显示数据大小
data.head() #显示数据的前五行

 结果如下:

(1796911, 14)

首先粗略统计论文种类信息:

'''
count:一列数据的元素个数;
unique:一列数据中元素的种类;
top:一列数据中出现频率最高的元素;
freq:一列数据中出现频率最高的元素的个数;
'''

data["categories"].describe()

结果如下:

count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object

以上结果表明数据集中共有1796911篇论文,分为62055个种类,论文数量最多的种类是天体物理学(astro-ph),该种类一共出现了86914次。

因为很多论文的类别实际上不止一种,所以仅有上面的统计远远不够,下面统计本数据集中共出现了多少种独立的数据集:

# 所有的种类(独立的)

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
print(len(unique_categories))
print(unique_categories)

相互独立的种类详细统计结果

以上结果表明共有176种独立论文种类,比官网上给出的类别数量更多,说明官网上有部分类别并未统计。不过从结果可以看出,计算机方向的论文种类并未改变,仍旧是官方统计的40种。

本次赛题是前沿学术分析,因此选择近两年的论文数据进行分析:

# 2019年后的数据
data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式,并提取处year
del data["update_date"] #删除 update_date特征,其使命已完成
# 2019
data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据
# data.groupby(['categories','year']) #以 categories 进行排序,如果同一个categories 相同则使用 year 特征进行排序
data.groupby(['categories', 'year'])

# data.reset_index(drop=True, inplace=True) #重新编号
data.reset_index(drop=True, inplace=True) #重新编号

# data #查看结果
print("19年以后的数据:")
data

 19年以后的数据:

idsubmitterauthorstitlecommentsjournal-refdoireport-nocategorieslicenseabstractversionsauthors_parsedyear
00704.0297Sung-Chul YoonSung-Chul Yoon, Philipp Podsiadlowski and Step...Remnant evolution after a carbon-oxygen white ...15 pages, 15 figures, 3 tables, submitted to M...None10.1111/j.1365-2966.2007.12161.xNoneastro-phNoneWe systematically explore the evolution of t...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...2019
10704.0342Patrice Ntumba PunguB. Dugmore and PP. NtumbaCofibrations in the Category of Frolicher Spac...27 pagesNoneNoneNonemath.ATNoneCofibrations are defined in the category of ...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...[[Dugmore, B., ], [Ntumba, PP., ]]2019
20704.0360ZaqarashviliT.V. Zaqarashvili and K MurawskiTorsional oscillations of longitudinally inhom...6 pages, 3 figures, accepted in A&ANone10.1051/0004-6361:20077246Noneastro-phNoneWe explore the effect of an inhomogeneous ma...[{'version': 'v1', 'created': 'Tue, 3 Apr 2007...[[Zaqarashvili, T. V., ], [Murawski, K, ]]2019
30704.0525Sezgin Ayg\"unSezgin Aygun, Ismail Tarhan, Husnu BaysalOn the Energy-Momentum Problem in Static Einst...This submission has been withdrawn by arXiv ad...Chin.Phys.Lett.24:355-358,200710.1088/0256-307X/24/2/015Nonegr-qcNoneThis paper has been removed by arXiv adminis...[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...2019
40704.0535Antonio PipinoAntonio Pipino (1,3), Thomas H. Puzia (2,4), a...The Formation of Globular Cluster Systems in M...32 pages (referee format), 9 figures, ApJ acce...Astrophys.J.665:295-305,200710.1086/519546Noneastro-phNoneThe most massive elliptical galaxies show a ...[{'version': 'v1', 'created': 'Wed, 4 Apr 2007...[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...2019
.............................................
395118quant-ph/9911051Stephen A. FullingS. A. FullingLarge Numbers, the Chinese Remainder Theorem, ...9 pages; Plain TeX with vanilla.sty and pictex...Phys. Rev. Applied 13, 024016 (2020)10.1103/PhysRevApplied.13.024016Nonequant-phNoneThis is a pedagogical article cited in the f...[{'version': 'v1', 'created': 'Thu, 11 Nov 199...[[Fulling, S. A., ]]2020
395119solv-int/9511005Wen-Xiu MaWen-Xiu Ma, Benno FuchssteinerExplicit and Exact Solutions to a Kolmogorov-P...14pages, Latex, to appear in Intern. J. Nonlin...None10.1016/0020-7462(95)00064-XNonesolv-int nlin.SINoneSome explicit traveling wave solutions to a ...[{'version': 'v1', 'created': 'Tue, 14 Nov 199...[[Ma, Wen-Xiu, ], [Fuchssteiner, Benno, ]]2019
395120solv-int/9809008Victor EnolskiiJ C Eilbeck, V Z Enol'skii, V B Kuznetsov, D V...Linear r-Matrix Algebra for a Hierarchy of One...plain LaTeX, 28 pagesNoneNoneNonesolv-int nlin.SINoneWe consider a hierarchy of many-particle sys...[{'version': 'v1', 'created': 'Wed, 2 Sep 1998...[[Eilbeck, J C, ], [Enol'skii, V Z, ], [Kuznet...2019
395121solv-int/9909010Pierre van MoerbekeM. Adler, T. Shiota and P. van MoerbekePfaff tau-functions42 pagesNoneNoneNonesolv-int adap-org hep-th nlin.AO nlin.SINoneConsider the evolution $$ \frac{\pl m_\iy}{\...[{'version': 'v1', 'created': 'Wed, 15 Sep 199...[[Adler, M., ], [Shiota, T., ], [van Moerbeke,...2019
395122solv-int/9909014David FairlieD.B. Fairlie and A.N. LeznovThe General Solution of the Complex Monge-Amp\...13 pages, latex, no figuresNone10.1088/0305-4470/33/25/307Nonesolv-int nlin.SINoneA general solution to the Complex Monge-Amp\...[{'version': 'v1', 'created': 'Thu, 16 Sep 199...[[Fairlie, D. B., ], [Leznov, A. N., ]]2019

395123 rows × 14 columns 

# 2020年后的数据
data20 = data[data["year"] >= 2020] #找出 year 中2020年以后的数据
data20.groupby(['categories','year']) #以 categories 进行排序,如果同一个categories 相同则使用 year 特征进行排序
data20.reset_index(drop=True, inplace=True) #重新编号
print("20年以后的数据:")
data20

20年以后的数据: 

idsubmitterauthorstitlecommentsjournal-refdoireport-nocategorieslicenseabstractversionsauthors_parsedyear
00704.0752Davoud KamaniDavoud KamaniActions for the Bosonic String with the Curved...8 pages, Latex, no figure, Some minor changes ...Braz. J. Phys. 38, 268-271 (2008)10.1590/S0103-97332008000200010Nonehep-thNoneAt first we introduce an action for the stri...[{'version': 'v1', 'created': 'Thu, 5 Apr 2007...[[Kamani, Davoud, ]]2020
10704.0880Qiuping A. WangQ. A. Wang (ISMANS), F. Tsobnang (ISMANS), S. ...Stochastic action principle and maximum entropyThis work is a further development of the idea...Chaos, Solitons and Fractals, 40(2009)2550-2556NoneNonecond-mat.stat-mechNoneA stochastic action principle for stochastic...[{'version': 'v1', 'created': 'Fri, 6 Apr 2007...[[Wang, Q. A., , ISMANS], [Tsobnang, F., , ISM...2020
20704.1403Alberto S. CattaneoAlberto S. Cattaneo, Florian SchaetzEquivalences of Higher Derived Brackets16 pages; minor changes; corrected typos; to a...J. Pure Appl. Algebra, 212, 2450-2460 (2008)10.1016/j.jpaa.2008.03.013Nonemath.QA math.DG math.SGNoneThis note elaborates on Th. Voronov's constr...[{'version': 'v1', 'created': 'Wed, 11 Apr 200...[[Cattaneo, Alberto S., ], [Schaetz, Florian, ]]2020
30704.2498Daniel H. LenzDaniel Lenz, Nicolae StrungaruPure Point spectrum for measure dynamical syst...22 pagesJournal de Math\'ematiques Pures et Appliqu\'e...10.1016/j.matpur.2009.05.013Nonemath-ph math.MPhttp://arxiv.org/licenses/nonexclusive-distrib...We show equivalence of pure point diffractio...[{'version': 'v1', 'created': 'Thu, 19 Apr 200...[[Lenz, Daniel, ], [Strungaru, Nicolae, ]]2020
40704.2967Serhiy Samokhvalov E.Serhiy E. SamokhvalovGroup-theoretic Description of Riemannian Spaces14 pagesUkrainian Math. J., v.55 (2003), 1238-124810.1023/B:UKMA.0000018010.14309.76Nonemath.DG math.GRNoneIt is shown that a locally geometrical struc...[{'version': 'v1', 'created': 'Mon, 23 Apr 200...[[Samokhvalov, Serhiy E., ]]2020
.............................................
224500quant-ph/9802022Masanao OzawaMasanao Ozawa (Nagoya University)On the Concept of Quantum State Reduction: Inc...15 pages, LaTeX, 1 Postscript figureAnnals of the Japan Association for Philosophy...10.4288/jafpos1956.11.107Nonequant-phNoneThe argument is re-examined that the program...[{'version': 'v1', 'created': 'Mon, 9 Feb 1998...[[Ozawa, Masanao, , Nagoya University]]2020
224501quant-ph/9806088Jens EisertJ. Eisert, M. Wilkens, and M. LewensteinQuantum Games and Quantum Strategies4 pages, 4 figures, typographic sign error in ...Phys. Rev. Lett. 83, 3077 (1999)10.1103/PhysRevLett.83.3077Nonequant-phhttp://arxiv.org/licenses/nonexclusive-distrib...We investigate the quantization of non-zero ...[{'version': 'v1', 'created': 'Fri, 26 Jun 199...[[Eisert, J., ], [Wilkens, M., ], [Lewenstein,...2020
224502quant-ph/9807034Jens EisertJ. Eisert (U. Potsdam, Germany), M. B. Plenio ...A comparison of entanglement measures6 pages (RevTeX), 4 figuresJ. Mod. Opt. 46, 145 (1999)10.1080/09500349908231260J. Mod. Opt. 46, 145-154 (1999)quant-phNoneWe compare the entanglement of formation wit...[{'version': 'v1', 'created': 'Mon, 13 Jul 199...[[Eisert, J., , U. Potsdam, Germany], [Plenio,...2020
224503quant-ph/9910035Pavel ExnerP.Duclos, P.Exner, and D. KrejcirikLocally curved quantum layersLaTeX2e, 15 pages, to appear in the Ukrainian ...Ukrainian J. Phys. 45 (2000), 595-601NoneNonequant-ph cond-mat math-ph math.MPNoneWe consider a quantum particle constrained t...[{'version': 'v1', 'created': 'Fri, 8 Oct 1999...[[Duclos, P., ], [Exner, P., ], [Krejcirik, D....2020
224504quant-ph/9911051Stephen A. FullingS. A. FullingLarge Numbers, the Chinese Remainder Theorem, ...9 pages; Plain TeX with vanilla.sty and pictex...Phys. Rev. Applied 13, 024016 (2020)10.1103/PhysRevApplied.13.024016Nonequant-phNoneThis is a pedagogical article cited in the f...[{'version': 'v1', 'created': 'Thu, 11 Nov 199...[[Fulling, S. A., ]]2020

224505 rows × 14 columns

我们得到了2019年以后及2020年以后的所有论文数据,然后分别从中挑选出计算机领域的文章(来源:arXiv Category Taxonomy):

# 挑选出计算机领域的统计数据
from numpy import random
#爬取所有的类别
user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    ]
headers = {
    'User-Agent':'Mozilla/5.0',
    'Content-Type':'application/json',
    'method':'GET',
    'Accept':'application/vnd.github.cloak-preview'
}
headers['User-Agent'] = random.choice(user_agent_list) # 伪装浏览器头部
website_url = requests.get('https://arxiv.org/category_taxonomy', headers = headers, verify = False).text #获取网页的文本数据
soup = BeautifulSoup(website_url,'html.parser') #爬取数据
root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签入口
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags

#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []

#进行
for t in tags:
    if t.name == "h2":
        level_1_name = t.text    
        level_2_code = t.text
        level_2_name = t.text
    elif t.name == "h3":
        raw = t.text
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串:raw
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
    elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
    elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)

#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
    'group_name' : level_1_names,
    'archive_name' : level_2_names,
    'archive_id' : level_2_codes,
    'category_name' : level_3_names,
    'categories' : level_3_codes,
    'category_description': level_3_notes
    
})

#按照 "group_name" 进行分组,在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy
 group_namearchive_namearchive_idcategory_namecategoriescategory_description
0Computer ScienceComputer ScienceComputer ScienceArtificial Intelligencecs.AICovers all areas of AI except Vision, Robotics...
1Computer ScienceComputer ScienceComputer ScienceHardware Architecturecs.ARCovers systems organization and hardware archi...
2Computer ScienceComputer ScienceComputer ScienceComputational Complexitycs.CCCovers models of computation, complexity class...
3Computer ScienceComputer ScienceComputer ScienceComputational Engineering, Finance, and Sciencecs.CECovers applications of computer science to the...
4Computer ScienceComputer ScienceComputer ScienceComputational Geometrycs.CGRoughly includes material in ACM Subject Class...
.....................
150StatisticsStatisticsStatisticsComputationstat.COAlgorithms, Simulation, Visualization
151StatisticsStatisticsStatisticsMethodologystat.MEDesign, Surveys, Model Selection, Multiple Tes...
152StatisticsStatisticsStatisticsMachine Learningstat.MLCovers machine learning papers (supervised, un...
153StatisticsStatisticsStatisticsOther Statisticsstat.OTWork in statistics that does not fit into the ...
154StatisticsStatisticsStatisticsStatistics Theorystat.THstat.TH is an alias for math.ST. Asymptotics, ...

155 rows × 6 columns

【注】此处记录出现过的一些问题及我的解决办法

问题1:最初的代码是这样的

soup = BeautifulSoup(website_url,'lxml') #爬取数据,这里使用lxml的解析器,加速

出现错误

FeatureNotFound

 我的解决方法是改成下面的

soup = BeautifulSoup(website_url,'html.parser') #爬取数据

 原因是有说法是bs4.0以后不用lxml,这里参考过的文章:bs4 FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

问题2:出现如下错误

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='baike.baidu.com', port=443):
 Max retries exceeded with url: https://baike.baidu.com/item/%E5%88%98%E5%BE%B7%E5%8D%8E/114923
 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb51433af98>:
 Failed to establish a new connection: [Errno -2] Name or service not known',))

这个错误经常出现,根据我出错的经验,这个错误可能是由于网络不稳定或是因访问频繁而导致IP被封,在爬虫中经常遇到。我的解决办法是添加如下“伪装头”:

user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    ]

headers = {
    'User-Agent':'Mozilla/5.0',
    'Content-Type':'application/json',
    'method':'GET',
    'Accept':'application/vnd.github.cloak-preview'
}

headers['User-Agent'] = random.choice(user_agent_list)

具体参考文章链接:

requests.exceptions.ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed c

python爬虫 requests异常requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded

Ⅲ、数据分析及可视化

首先查看一下所有大类的paper数量分布:

# 所有大类的paper数量分布(2019后)
_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df

通过共同属性 “categories” 进行两表的合并,以 “group_name” 作为类别进行统计,统计结果放入 “id” 列中并排序,统计结果如下: 

 group_nameid
0Physics79985
1Mathematics51567
2Computer Science40067
3Statistics4054
4Electrical Engineering and Systems Science3297
5Quantitative Biology1994
6Quantitative Finance826
7Economics576

 同理,2020年后的数据如下:

# 所有大类的paper数量分布(2020后)
_df20 = data[data['year']>=2020].merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()

_df20
 group_nameid
0Physics41606
1Mathematics27072
2Computer Science21980
3Statistics2252
4Electrical Engineering and Systems Science1926
5Quantitative Biology1108
6Quantitative Finance474
7Economics403

 使用饼图进行可视化:

# 饼图可视化(2019后)
fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1) 
plt.pie(_df["id"],  labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()

2019年后各学科研究量占比图

同理可画出2020年后的分布饼图:

2020年后各学科研究量占比图

接下来统计在计算机各个子领域2019年与2020年后的paper数量:

# 计算机各个子领域2019年与2020年的paper数量
group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id") 
year
category_name
20192020
Artificial Intelligence558757
Computation and Language21532906
Computational Complexity131188
Computational Engineering, Finance, and Science108205
Computational Geometry199216
Computer Science and Game Theory281323
Computer Vision and Pattern Recognition55596517
Computers and Society346564
Cryptography and Security10671238
Data Structures and Algorithms711902
Databases282342
Digital Libraries125157
Discrete Mathematics8481
Distributed, Parallel, and Cluster Computing715774
Emerging Technologies10184
Formal Languages and Automata Theory152137
General Literature55
Graphics116151
Hardware Architecture95159
Human-Computer Interaction420580
Information Retrieval245331
Logic in Computer Science470504
Machine Learning177538
Mathematical Software2745
Multiagent Systems8590
Multimedia7666
Networking and Internet Architecture864783
Neural and Evolutionary Computing235279
Numerical Analysis4011
Operating Systems3633
Other Computer Science6769
Performance4551
Programming Languages268294
Robotics9171298
Social and Information Networks202325
Software Engineering659804
Sound74
Symbolic Computation4436
Systems and Control415133

可以看出,2020年大部分种类论文数量都大于2019年,同时两年最高产的论文类型都是计算机视觉与模式识别(Computer Vision and Pattern Recognition),并且数量远大于其他类型,看来CV和PR仍然是当前学术研究的主流方向。

Ⅳ、参考资料

Datawhale数据分析训练营学习手册(学术前沿趋势分析)——Task1:论文数据统计

Pandas DataFrame的基本属性详解

Pandas入门(二)——DataFrame结构及常用操作

bs4 FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

requests.exceptions.ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed c

python爬虫 requests异常requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded

标签:数据分析,Task1,name,level,created,version,学术前沿,...,data
来源: https://blog.csdn.net/xyc_undermoon/article/details/112552113