数据分析入门(学术前沿趋势分析)Task1-论文数据统计
作者:互联网
此次赛题是零基础入门数据分析(学术前沿趋势分析),使用公开的arXiv论文完成对应的数据分析操作。赛题内容包括对论文数量、作者出现频率、论文源码的统计,对论文进行分类以及对论文作者的关系进行建模。
目录
Ⅰ、数据及背景
- 主题:统计论文数量
- 内容:理解赛题、学习利用 Pandas 读取数据并进行统计
- 数据集:arXiv 重要的学术公开网站,也是搜索、浏览和下载学术论文的重要工具。arXiv论文涵盖的范围非常广,涉及物理学的庞大分支和计算机科学的众多子学科,如数学、统计学、电气工程、定量生物学和经济学等等。
Ⅱ、数据集介绍
数据集部分介绍如下:
- 数据集格式:
id | arXiv,可用于访问论文 |
submitter | 论文提交者 |
authors | 论文作者 |
title | 论文标题 |
comments | 论文页数和图表等其他信息 |
journal-ref | 论文发表的期刊的信息 |
doi | 数字对象标识符 |
report-no | 报告编号 |
categories | 论文在arXiv系统的所属类别或标签 |
license | 文章的许可证 |
abstract | 论文摘要 |
versions | 论文版本 |
authors_parsed | 作者的信息 |
- 论文部分类别
astro-ph | 天体物理学(Astrophysics) |
astro-ph.CO | 宇宙学与非银河系天体物理学(Cosmology and Nongalactic Astrophysics) |
astro-ph.EP | 地球与行星天体物理学(Earth and Planetary Astrophysics) |
astro-ph.GA | 星系(Astrophysics of Galaxies) |
cs.AI | Artificial Intelligence |
cs.AR | 硬件体系结构(Hardware Architecture) |
cs.CC | 计算复杂度(Computational Complexity) |
cs.CE | 计算工程、金融与科学(Computational Engineering, Finance, and Science) |
cs.CV | 计算机视觉与模式识别(Computer Vision and Pattern Recognition) |
cs.CY | 计算机与社会(Computers and Society) |
cs.DB | 数据库(Databases) |
cs.DC | 分布式、并行与集群计算(Distributed, Parallel, and Cluster Computing) |
cs.DL | 数字图书馆(Digital Libraries) |
cs.NA | 数值分析(Numerical Analysis) |
cs.NE | 神经进化计算(Neural and Evolutionary Computing) |
cs.NI | 网络与互联网架构(Networking and Internet Architecture) |
cs.OH | 其他计算机科学(Other Computer Science) |
cs.OS | 操作系统(Operating Systems) |
Ⅲ、代码实现
Ⅰ、环境准备
所需要的模块包括:seaborn、BeautifulSoup、requests、json、pandas、matplotlib。
Ⅱ、数据预处理
读取数据:
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import matplotlib.pyplot as plt #画图工具
# 数据导入
data = [] #初始化
#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open(r"D:/xyc/competPractice/dataAnalysis2101/archive/arxiv-metadata-oai-snapshot.json", 'r') as f:
for line in f:
data.append(json.loads(line))
data = pd.DataFrame(data) #将list变为dataframe格式,方便使用pandas进行分析
print(data.shape) #显示数据大小
data.head() #显示数据的前五行
结果如下:
(1796911, 14)
首先粗略统计论文种类信息:
'''
count:一列数据的元素个数;
unique:一列数据中元素的种类;
top:一列数据中出现频率最高的元素;
freq:一列数据中出现频率最高的元素的个数;
'''
data["categories"].describe()
结果如下:
count 1796911 unique 62055 top astro-ph freq 86914 Name: categories, dtype: object
以上结果表明数据集中共有1796911篇论文,分为62055个种类,论文数量最多的种类是天体物理学(astro-ph),该种类一共出现了86914次。
因为很多论文的类别实际上不止一种,所以仅有上面的统计远远不够,下面统计本数据集中共出现了多少种独立的数据集:
# 所有的种类(独立的)
unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
print(len(unique_categories))
print(unique_categories)
以上结果表明共有176种独立论文种类,比官网上给出的类别数量更多,说明官网上有部分类别并未统计。不过从结果可以看出,计算机方向的论文种类并未改变,仍旧是官方统计的40种。
本次赛题是前沿学术分析,因此选择近两年的论文数据进行分析:
# 2019年后的数据
data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式,并提取处year
del data["update_date"] #删除 update_date特征,其使命已完成
# 2019
data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据
# data.groupby(['categories','year']) #以 categories 进行排序,如果同一个categories 相同则使用 year 特征进行排序
data.groupby(['categories', 'year'])
# data.reset_index(drop=True, inplace=True) #重新编号
data.reset_index(drop=True, inplace=True) #重新编号
# data #查看结果
print("19年以后的数据:")
data
19年以后的数据:
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | authors_parsed | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0704.0297 | Sung-Chul Yoon | Sung-Chul Yoon, Philipp Podsiadlowski and Step... | Remnant evolution after a carbon-oxygen white ... | 15 pages, 15 figures, 3 tables, submitted to M... | None | 10.1111/j.1365-2966.2007.12161.x | None | astro-ph | None | We systematically explore the evolution of t... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,... | 2019 |
1 | 0704.0342 | Patrice Ntumba Pungu | B. Dugmore and PP. Ntumba | Cofibrations in the Category of Frolicher Spac... | 27 pages | None | None | None | math.AT | None | Cofibrations are defined in the category of ... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | [[Dugmore, B., ], [Ntumba, PP., ]] | 2019 |
2 | 0704.0360 | Zaqarashvili | T.V. Zaqarashvili and K Murawski | Torsional oscillations of longitudinally inhom... | 6 pages, 3 figures, accepted in A&A | None | 10.1051/0004-6361:20077246 | None | astro-ph | None | We explore the effect of an inhomogeneous ma... | [{'version': 'v1', 'created': 'Tue, 3 Apr 2007... | [[Zaqarashvili, T. V., ], [Murawski, K, ]] | 2019 |
3 | 0704.0525 | Sezgin Ayg\"un | Sezgin Aygun, Ismail Tarhan, Husnu Baysal | On the Energy-Momentum Problem in Static Einst... | This submission has been withdrawn by arXiv ad... | Chin.Phys.Lett.24:355-358,2007 | 10.1088/0256-307X/24/2/015 | None | gr-qc | None | This paper has been removed by arXiv adminis... | [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... | [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa... | 2019 |
4 | 0704.0535 | Antonio Pipino | Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... | The Formation of Globular Cluster Systems in M... | 32 pages (referee format), 9 figures, ApJ acce... | Astrophys.J.665:295-305,2007 | 10.1086/519546 | None | astro-ph | None | The most massive elliptical galaxies show a ... | [{'version': 'v1', 'created': 'Wed, 4 Apr 2007... | [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M... | 2019 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
395118 | quant-ph/9911051 | Stephen A. Fulling | S. A. Fulling | Large Numbers, the Chinese Remainder Theorem, ... | 9 pages; Plain TeX with vanilla.sty and pictex... | Phys. Rev. Applied 13, 024016 (2020) | 10.1103/PhysRevApplied.13.024016 | None | quant-ph | None | This is a pedagogical article cited in the f... | [{'version': 'v1', 'created': 'Thu, 11 Nov 199... | [[Fulling, S. A., ]] | 2020 |
395119 | solv-int/9511005 | Wen-Xiu Ma | Wen-Xiu Ma, Benno Fuchssteiner | Explicit and Exact Solutions to a Kolmogorov-P... | 14pages, Latex, to appear in Intern. J. Nonlin... | None | 10.1016/0020-7462(95)00064-X | None | solv-int nlin.SI | None | Some explicit traveling wave solutions to a ... | [{'version': 'v1', 'created': 'Tue, 14 Nov 199... | [[Ma, Wen-Xiu, ], [Fuchssteiner, Benno, ]] | 2019 |
395120 | solv-int/9809008 | Victor Enolskii | J C Eilbeck, V Z Enol'skii, V B Kuznetsov, D V... | Linear r-Matrix Algebra for a Hierarchy of One... | plain LaTeX, 28 pages | None | None | None | solv-int nlin.SI | None | We consider a hierarchy of many-particle sys... | [{'version': 'v1', 'created': 'Wed, 2 Sep 1998... | [[Eilbeck, J C, ], [Enol'skii, V Z, ], [Kuznet... | 2019 |
395121 | solv-int/9909010 | Pierre van Moerbeke | M. Adler, T. Shiota and P. van Moerbeke | Pfaff tau-functions | 42 pages | None | None | None | solv-int adap-org hep-th nlin.AO nlin.SI | None | Consider the evolution $$ \frac{\pl m_\iy}{\... | [{'version': 'v1', 'created': 'Wed, 15 Sep 199... | [[Adler, M., ], [Shiota, T., ], [van Moerbeke,... | 2019 |
395122 | solv-int/9909014 | David Fairlie | D.B. Fairlie and A.N. Leznov | The General Solution of the Complex Monge-Amp\... | 13 pages, latex, no figures | None | 10.1088/0305-4470/33/25/307 | None | solv-int nlin.SI | None | A general solution to the Complex Monge-Amp\... | [{'version': 'v1', 'created': 'Thu, 16 Sep 199... | [[Fairlie, D. B., ], [Leznov, A. N., ]] | 2019 |
395123 rows × 14 columns
# 2020年后的数据
data20 = data[data["year"] >= 2020] #找出 year 中2020年以后的数据
data20.groupby(['categories','year']) #以 categories 进行排序,如果同一个categories 相同则使用 year 特征进行排序
data20.reset_index(drop=True, inplace=True) #重新编号
print("20年以后的数据:")
data20
20年以后的数据:
id | submitter | authors | title | comments | journal-ref | doi | report-no | categories | license | abstract | versions | authors_parsed | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0704.0752 | Davoud Kamani | Davoud Kamani | Actions for the Bosonic String with the Curved... | 8 pages, Latex, no figure, Some minor changes ... | Braz. J. Phys. 38, 268-271 (2008) | 10.1590/S0103-97332008000200010 | None | hep-th | None | At first we introduce an action for the stri... | [{'version': 'v1', 'created': 'Thu, 5 Apr 2007... | [[Kamani, Davoud, ]] | 2020 |
1 | 0704.0880 | Qiuping A. Wang | Q. A. Wang (ISMANS), F. Tsobnang (ISMANS), S. ... | Stochastic action principle and maximum entropy | This work is a further development of the idea... | Chaos, Solitons and Fractals, 40(2009)2550-2556 | None | None | cond-mat.stat-mech | None | A stochastic action principle for stochastic... | [{'version': 'v1', 'created': 'Fri, 6 Apr 2007... | [[Wang, Q. A., , ISMANS], [Tsobnang, F., , ISM... | 2020 |
2 | 0704.1403 | Alberto S. Cattaneo | Alberto S. Cattaneo, Florian Schaetz | Equivalences of Higher Derived Brackets | 16 pages; minor changes; corrected typos; to a... | J. Pure Appl. Algebra, 212, 2450-2460 (2008) | 10.1016/j.jpaa.2008.03.013 | None | math.QA math.DG math.SG | None | This note elaborates on Th. Voronov's constr... | [{'version': 'v1', 'created': 'Wed, 11 Apr 200... | [[Cattaneo, Alberto S., ], [Schaetz, Florian, ]] | 2020 |
3 | 0704.2498 | Daniel H. Lenz | Daniel Lenz, Nicolae Strungaru | Pure Point spectrum for measure dynamical syst... | 22 pages | Journal de Math\'ematiques Pures et Appliqu\'e... | 10.1016/j.matpur.2009.05.013 | None | math-ph math.MP | http://arxiv.org/licenses/nonexclusive-distrib... | We show equivalence of pure point diffractio... | [{'version': 'v1', 'created': 'Thu, 19 Apr 200... | [[Lenz, Daniel, ], [Strungaru, Nicolae, ]] | 2020 |
4 | 0704.2967 | Serhiy Samokhvalov E. | Serhiy E. Samokhvalov | Group-theoretic Description of Riemannian Spaces | 14 pages | Ukrainian Math. J., v.55 (2003), 1238-1248 | 10.1023/B:UKMA.0000018010.14309.76 | None | math.DG math.GR | None | It is shown that a locally geometrical struc... | [{'version': 'v1', 'created': 'Mon, 23 Apr 200... | [[Samokhvalov, Serhiy E., ]] | 2020 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
224500 | quant-ph/9802022 | Masanao Ozawa | Masanao Ozawa (Nagoya University) | On the Concept of Quantum State Reduction: Inc... | 15 pages, LaTeX, 1 Postscript figure | Annals of the Japan Association for Philosophy... | 10.4288/jafpos1956.11.107 | None | quant-ph | None | The argument is re-examined that the program... | [{'version': 'v1', 'created': 'Mon, 9 Feb 1998... | [[Ozawa, Masanao, , Nagoya University]] | 2020 |
224501 | quant-ph/9806088 | Jens Eisert | J. Eisert, M. Wilkens, and M. Lewenstein | Quantum Games and Quantum Strategies | 4 pages, 4 figures, typographic sign error in ... | Phys. Rev. Lett. 83, 3077 (1999) | 10.1103/PhysRevLett.83.3077 | None | quant-ph | http://arxiv.org/licenses/nonexclusive-distrib... | We investigate the quantization of non-zero ... | [{'version': 'v1', 'created': 'Fri, 26 Jun 199... | [[Eisert, J., ], [Wilkens, M., ], [Lewenstein,... | 2020 |
224502 | quant-ph/9807034 | Jens Eisert | J. Eisert (U. Potsdam, Germany), M. B. Plenio ... | A comparison of entanglement measures | 6 pages (RevTeX), 4 figures | J. Mod. Opt. 46, 145 (1999) | 10.1080/09500349908231260 | J. Mod. Opt. 46, 145-154 (1999) | quant-ph | None | We compare the entanglement of formation wit... | [{'version': 'v1', 'created': 'Mon, 13 Jul 199... | [[Eisert, J., , U. Potsdam, Germany], [Plenio,... | 2020 |
224503 | quant-ph/9910035 | Pavel Exner | P.Duclos, P.Exner, and D. Krejcirik | Locally curved quantum layers | LaTeX2e, 15 pages, to appear in the Ukrainian ... | Ukrainian J. Phys. 45 (2000), 595-601 | None | None | quant-ph cond-mat math-ph math.MP | None | We consider a quantum particle constrained t... | [{'version': 'v1', 'created': 'Fri, 8 Oct 1999... | [[Duclos, P., ], [Exner, P., ], [Krejcirik, D.... | 2020 |
224504 | quant-ph/9911051 | Stephen A. Fulling | S. A. Fulling | Large Numbers, the Chinese Remainder Theorem, ... | 9 pages; Plain TeX with vanilla.sty and pictex... | Phys. Rev. Applied 13, 024016 (2020) | 10.1103/PhysRevApplied.13.024016 | None | quant-ph | None | This is a pedagogical article cited in the f... | [{'version': 'v1', 'created': 'Thu, 11 Nov 199... | [[Fulling, S. A., ]] | 2020 |
224505 rows × 14 columns
我们得到了2019年以后及2020年以后的所有论文数据,然后分别从中挑选出计算机领域的文章(来源:arXiv Category Taxonomy):
# 挑选出计算机领域的统计数据
from numpy import random
#爬取所有的类别
user_agent_list = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
]
headers = {
'User-Agent':'Mozilla/5.0',
'Content-Type':'application/json',
'method':'GET',
'Accept':'application/vnd.github.cloak-preview'
}
headers['User-Agent'] = random.choice(user_agent_list) # 伪装浏览器头部
website_url = requests.get('https://arxiv.org/category_taxonomy', headers = headers, verify = False).text #获取网页的文本数据
soup = BeautifulSoup(website_url,'html.parser') #爬取数据
root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签入口
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags
#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []
#进行
for t in tags:
if t.name == "h2":
level_1_name = t.text
level_2_code = t.text
level_2_name = t.text
elif t.name == "h3":
raw = t.text
level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式:模式字符串:(.*)\((.*)\);被替换字符串"\2";被处理字符串:raw
level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)
elif t.name == "h4":
raw = t.text
level_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)
level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)
elif t.name == "p":
notes = t.text
level_1_names.append(level_1_name)
level_2_names.append(level_2_name)
level_2_codes.append(level_2_code)
level_3_names.append(level_3_name)
level_3_codes.append(level_3_code)
level_3_notes.append(notes)
#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({
'group_name' : level_1_names,
'archive_name' : level_2_names,
'archive_id' : level_2_codes,
'category_name' : level_3_names,
'categories' : level_3_codes,
'category_description': level_3_notes
})
#按照 "group_name" 进行分组,在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy
group_name | archive_name | archive_id | category_name | categories | category_description | |
---|---|---|---|---|---|---|
0 | Computer Science | Computer Science | Computer Science | Artificial Intelligence | cs.AI | Covers all areas of AI except Vision, Robotics... |
1 | Computer Science | Computer Science | Computer Science | Hardware Architecture | cs.AR | Covers systems organization and hardware archi... |
2 | Computer Science | Computer Science | Computer Science | Computational Complexity | cs.CC | Covers models of computation, complexity class... |
3 | Computer Science | Computer Science | Computer Science | Computational Engineering, Finance, and Science | cs.CE | Covers applications of computer science to the... |
4 | Computer Science | Computer Science | Computer Science | Computational Geometry | cs.CG | Roughly includes material in ACM Subject Class... |
... | ... | ... | ... | ... | ... | ... |
150 | Statistics | Statistics | Statistics | Computation | stat.CO | Algorithms, Simulation, Visualization |
151 | Statistics | Statistics | Statistics | Methodology | stat.ME | Design, Surveys, Model Selection, Multiple Tes... |
152 | Statistics | Statistics | Statistics | Machine Learning | stat.ML | Covers machine learning papers (supervised, un... |
153 | Statistics | Statistics | Statistics | Other Statistics | stat.OT | Work in statistics that does not fit into the ... |
154 | Statistics | Statistics | Statistics | Statistics Theory | stat.TH | stat.TH is an alias for math.ST. Asymptotics, ... |
155 rows × 6 columns
【注】此处记录出现过的一些问题及我的解决办法
问题1:最初的代码是这样的
soup = BeautifulSoup(website_url,'lxml') #爬取数据,这里使用lxml的解析器,加速
出现错误
我的解决方法是改成下面的
soup = BeautifulSoup(website_url,'html.parser') #爬取数据
原因是有说法是bs4.0以后不用lxml,这里参考过的文章:bs4 FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
问题2:出现如下错误
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='baike.baidu.com', port=443): Max retries exceeded with url: https://baike.baidu.com/item/%E5%88%98%E5%BE%B7%E5%8D%8E/114923 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb51433af98>: Failed to establish a new connection: [Errno -2] Name or service not known',))
这个错误经常出现,根据我出错的经验,这个错误可能是由于网络不稳定或是因访问频繁而导致IP被封,在爬虫中经常遇到。我的解决办法是添加如下“伪装头”:
user_agent_list = [ "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)", "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15" ] headers = { 'User-Agent':'Mozilla/5.0', 'Content-Type':'application/json', 'method':'GET', 'Accept':'application/vnd.github.cloak-preview' } headers['User-Agent'] = random.choice(user_agent_list)
具体参考文章链接:
requests.exceptions.ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed c
python爬虫 requests异常requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded
Ⅲ、数据分析及可视化
首先查看一下所有大类的paper数量分布:
# 所有大类的paper数量分布(2019后)
_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
_df
通过共同属性 “categories” 进行两表的合并,以 “group_name” 作为类别进行统计,统计结果放入 “id” 列中并排序,统计结果如下:
group_name | id | |
---|---|---|
0 | Physics | 79985 |
1 | Mathematics | 51567 |
2 | Computer Science | 40067 |
3 | Statistics | 4054 |
4 | Electrical Engineering and Systems Science | 3297 |
5 | Quantitative Biology | 1994 |
6 | Quantitative Finance | 826 |
7 | Economics | 576 |
同理,2020年后的数据如下:
# 所有大类的paper数量分布(2020后)
_df20 = data[data['year']>=2020].merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
_df20
group_name | id | |
---|---|---|
0 | Physics | 41606 |
1 | Mathematics | 27072 |
2 | Computer Science | 21980 |
3 | Statistics | 2252 |
4 | Electrical Engineering and Systems Science | 1926 |
5 | Quantitative Biology | 1108 |
6 | Quantitative Finance | 474 |
7 | Economics | 403 |
使用饼图进行可视化:
# 饼图可视化(2019后)
fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1)
plt.pie(_df["id"], labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()
同理可画出2020年后的分布饼图:
接下来统计在计算机各个子领域2019年与2020年后的paper数量:
# 计算机各个子领域2019年与2020年的paper数量
group_name="Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id")
year category_name | 2019 | 2020 |
---|---|---|
Artificial Intelligence | 558 | 757 |
Computation and Language | 2153 | 2906 |
Computational Complexity | 131 | 188 |
Computational Engineering, Finance, and Science | 108 | 205 |
Computational Geometry | 199 | 216 |
Computer Science and Game Theory | 281 | 323 |
Computer Vision and Pattern Recognition | 5559 | 6517 |
Computers and Society | 346 | 564 |
Cryptography and Security | 1067 | 1238 |
Data Structures and Algorithms | 711 | 902 |
Databases | 282 | 342 |
Digital Libraries | 125 | 157 |
Discrete Mathematics | 84 | 81 |
Distributed, Parallel, and Cluster Computing | 715 | 774 |
Emerging Technologies | 101 | 84 |
Formal Languages and Automata Theory | 152 | 137 |
General Literature | 5 | 5 |
Graphics | 116 | 151 |
Hardware Architecture | 95 | 159 |
Human-Computer Interaction | 420 | 580 |
Information Retrieval | 245 | 331 |
Logic in Computer Science | 470 | 504 |
Machine Learning | 177 | 538 |
Mathematical Software | 27 | 45 |
Multiagent Systems | 85 | 90 |
Multimedia | 76 | 66 |
Networking and Internet Architecture | 864 | 783 |
Neural and Evolutionary Computing | 235 | 279 |
Numerical Analysis | 40 | 11 |
Operating Systems | 36 | 33 |
Other Computer Science | 67 | 69 |
Performance | 45 | 51 |
Programming Languages | 268 | 294 |
Robotics | 917 | 1298 |
Social and Information Networks | 202 | 325 |
Software Engineering | 659 | 804 |
Sound | 7 | 4 |
Symbolic Computation | 44 | 36 |
Systems and Control | 415 | 133 |
可以看出,2020年大部分种类论文数量都大于2019年,同时两年最高产的论文类型都是计算机视觉与模式识别(Computer Vision and Pattern Recognition),并且数量远大于其他类型,看来CV和PR仍然是当前学术研究的主流方向。
Ⅳ、参考资料
Datawhale数据分析训练营学习手册(学术前沿趋势分析)——Task1:论文数据统计
requests.exceptions.ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed c
python爬虫 requests异常requests.exceptions.ConnectionError: HTTPSConnectionPool Max retries exceeded
标签:数据分析,Task1,name,level,created,version,学术前沿,...,data 来源: https://blog.csdn.net/xyc_undermoon/article/details/112552113