pandas处理json数据
作者:互联网
pandas处理json数据
将json串解析为DataFrame的方式主要有三种:
- 利用pandas自带的read_json直接解析字符串
- 利用json的loads和pandas的json_normalize进行解析
- 利用json的loads和pandas的DataFrame直接构造(这个过程需要手动修改loads得到的字典格式)
由于read_json直接对字符串进行的解析,其效率是最高的,但是其对JSON串的要求也是最高的,需要满足其规定的格式才能够读取。其支持的格式可以在pandas的官网点击打开链接可以看到。然而json_normalize是解析json串构造的字典的,其灵活性比read_json要高很多。但是令人意外的是,其效率还不如我自己解析来得快(自己解析时使用列表解析的功能比普通的for循环快很多)。当然最灵活的还是自己解析,可以在构造DataFrame之前进行一些简单的数据处理。
# -*- coding: UTF-8 -*- from pandas.io.json import json_normalize import pandas as pd import json import time # 读入数据 data_str = open('data.json').read() print data_str # 测试json_normalize start_time = time.time() for i in range(0, 300): data_list = json.loads(data_str) df = json_normalize(data_list) end_time = time.time() print end_time - start_time # 测试自己构造 start_time = time.time() for i in range(0, 300): data_list = json.loads(data_str) data = [[d['timestamp'], d['value']] for d in data_list] df = pd.DataFrame(data, columns=['timestamp', 'value']) end_time = time.time() print end_time - start_time # 测试read_json start_time = time.time() for i in range(0, 300): df = pd.read_json(data_str, orient='records') end_time = time.time() print end_time - start_time
pandas里的read_json函数可以将json数据转化为dataframe。 pandas.read_json的语法如下:
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
第一参数就是json文件路径或者json格式的字符串。
第二参数orient是表明预期的json字符串格式。orient的设置有以下几个值:
(1).'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。
import pandas as pd s='{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}' print(pd.read_json(s,orient='split'))
运行结果:
a b 1 1 3 2 2 8 3 3 9
(2). 'records' : list like [{column -> value}, ... , {column -> value}]
这种就是成员为字典的列表。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。
import pandas as pd s='[{"name":"xiaomaimiao","age":20},{"name":"xxt","age":18},{"name":"xmm","age":1}]' print(pd.read_json(s,orient='records'))
运行结果:
age name 0 20 xiaomaimiao 1 18 xxt 2 1 xmm
再例如:
# coding=utf-8 import pandas as pd pd.set_option('display.max_rows',500) pd.set_option('display.max_columns',500) pd.set_option('display.width',1000) s=open('a.json', encoding='UTF-8').read() df=pd.read_json(s,orient='records') print(df.head(5)) # df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"]) df.to_excel('pandas处理json1.xlsx', index=False)
运行结果:
数据: a.json.zip
或: https://raw.githubusercontent.com/lhrbest/Python/master/xxt_test_json.json
(3). 'index' : dict like {index -> {column -> value}}
以索引为key,以列字段构成的字典为键值。如:
import pandas as pd s='{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}' print(pd.read_json(s,orient='index'))
运行结果:
a b 0 1 2 1 9 11
(4). 'columns' : dict like {column -> {index -> value}}
这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:
import pandas as pd s='{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}' print(pd.read_json(s,orient='columns'))
运行结果:
a b 0 1 2 1 9 11
(5). 'values' : just the values array
values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。
import pandas as pd s='[["a",1],["b",2]]' print(pd.read_json(s,orient='values'))
运行结果:
0 1 0 a 1 1 b 2
要处理的json字符串:
strtext='[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\ {"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\ {"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\ {"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\ {"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'
代码:
strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\ {"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\ {"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\ {"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\ {"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]' df = pd.read_json(strtext, orient='records') df.to_excel('pandas处理json.xlsx', index=False, columns=["ttery", "issue", "code", "code1", "code2", "time"])
运行结果: 最终写入excel如下图:
再例如:
# coding=utf-8 import pandas as pd pd.set_option('display.max_rows',500) pd.set_option('display.max_columns',500) pd.set_option('display.width',1000) s='[{"Location":{"L1":"北京","L2":"北京市","L3":""},"University":null,"StudentFlag":0,"_id":"5cd682d61c0acc002c7a9282","Teams":[{"TeamOid":"5cd685cbe32417002baf7fb8"}],"Name":"soyotec","Avatar":"//cdn.kesci.com/images/avatar/4.jpg","Job":"总经理","Company":"北京树优信息技术有限公司","MajorTag":null},{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5cf094ffa0e904002bae85a1","Teams":[{"TeamOid":"5cf08ec12e1a1d002b3912ed"}],"Name":"ne12212","Avatar":"//cdn.kesci.com/images/avatar/1.jpg","Job":"产品经理","Company":"中国电信股份有限公司上海分公司","MajorTag":null},{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5ce37539e60640002b771cfa","Teams":[{"TeamOid":"5ce3af2f0e87f8002caad07e"}],"Name":"图森未来","Avatar":"https://cdn.kesci.com/upload/image/pruh63s30o.jpg","Job":"AIWIN大赛","Company":"上海图森未来人工智能科技有限公司","MajorTag":null,"Signature":"L4级无人驾驶卡车"}]' print(pd.read_json(s,orient='records'))
结果:
Avatar Company Job Location MajorTag Name Signature StudentFlag Teams University _id 0 //cdn.kesci.com/images/avatar/4.jpg 北京树优信息技术有限公司 总经理 {'L1': '北京', 'L2': '北京市', 'L3': ''} NaN soyotec NaN 0 [{'TeamOid': '5cd685cbe32417002baf7fb8'}] NaN 5cd682d61c0acc002c7a9282 1 //cdn.kesci.com/images/avatar/1.jpg 中国电信股份有限公司上海分公司 产品经理 {'L1': '上海', 'L2': '上海市', 'L3': ''} NaN ne12212 NaN 0 [{'TeamOid': '5cf08ec12e1a1d002b3912ed'}] NaN 5cf094ffa0e904002bae85a1 2 https://cdn.kesci.com/upload/image/pruh63s30o.jpg 上海图森未来人工智能科技有限公司 AIWIN大赛 {'L1': '上海', 'L2': '上海市', 'L3': ''} NaN 图森未来 L4级无人驾驶卡车 0 [{'TeamOid': '5ce3af2f0e87f8002caad07e'}] NaN 5ce37539e60640002b771cfa
示例:
from pandas.io.json import json_normalize data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}}, {'name': {'given': 'Mose', 'family': 'Regner'}}, {'id': 2, 'name': 'Faye Raker'}] print(json_normalize(data))
运行结果:
id name name.family name.first name.given name.last 0 1.0 NaN NaN Coleen NaN Volk 1 NaN NaN Regner NaN Mose NaN 2 2.0 Faye Raker NaN NaN NaN NaN
示例:
from pandas.io.json import json_normalize data = [{'state': 'Florida', 'shortname': 'FL', 'info': { 'governor': 'Rick Scott' }, 'counties': [{'name': 'Dade', 'population': 12345}, {'name': 'Broward', 'population': 40000}, {'name': 'Palm Beach', 'population': 60000}]}, {'state': 'Ohio', 'shortname': 'OH', 'info': { 'governor': 'John Kasich' }, 'counties': [{'name': 'Summit', 'population': 1234}, {'name': 'Cuyahoga', 'population': 1337}]}] result = json_normalize(data, 'counties', ['state', 'shortname',['info', 'governor']]) print(result)
运行结果:
name population state shortname info.governor 0 Dade 12345 Florida FL Rick Scott 1 Broward 40000 Florida FL Rick Scott 2 Palm Beach 60000 Florida FL Rick Scott 3 Summit 1234 Ohio OH John Kasich 4 Cuyahoga 1337 Ohio OH John Kasich
示例:
from pandas.io.json import json_normalize data = {'A': [1, 2]} result=json_normalize(data, 'A', record_prefix='Prefix.') print(result)
运行结果:
Prefix.0 0 1 1 2
示例:
#coding=utf-8 from pandas.io.json import json_normalize import json # 读入数据 data_str = open('a.json',encoding='utf-8').read() data_list = json.loads(data_str) df = json_normalize(data_list) df.to_excel('1.xlsx', index=False ) print(df)
运行结果:
a.json.zip
或: https://raw.githubusercontent.com/lhrbest/Python/master/xxt_test_json.json
示例:
# coding=utf-8 import json import pandas as pd pd.set_option('display.max_rows',500) pd.set_option('display.max_columns',500) pd.set_option('display.width',1000) data_str=open('a.json', encoding='UTF-8').read() data_list = json.loads(data_str) data = [[d["Location"], d["Company"]] for d in data_list] df = pd.DataFrame(data, columns=["Company", "Location"]) print(df.head(5)) # df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"]) # df.to_excel('pandas处理json1.xlsx', index=False)
结果:
Company Location 0 {'L1': '北京', 'L2': '北京市', 'L3': ''} 北京树优信息技术有限公司 1 {'L1': '上海', 'L2': '上海市', 'L3': ''} 中国电信股份有限公司上海分公司 2 {'L1': '上海', 'L2': '上海市', 'L3': ''} 上海图森未来人工智能科技有限公司 3 {'L1': '上海', 'L2': '上海市', 'L3': ''} None 4 {'L1': '上海', 'L2': '上海市', 'L3': ''} AI
示例:
import json import pandas as pd data_str = '''[{"department": "xxt", "query_result": {"code": "10", "description": "1000"}, "is_invoice": 1, "imageName": "./imgs/8888888.jpeg", "reco_result": {"total": "", "invoice_no": "01111111", "create_date": "", "check_code": "", "invoice_code": ""}}, {"department": "xxt2", "query_result": {}, "is_invoice": 0, "imageName": "./imgs/51111111.jpeg", "reco_result": {}}]''' # 从文件读取 # data_str = open('a.json',encoding='utf-8').read() # data_list = json.loads(data_str) # 从字符串直接获取 data_list = json.loads(data_str) data1_all = [[d["department"], d["is_invoice"], d["imageName"]] for d in data_list] data2_all = [d["query_result"] for d in data_list] data5_all = [d["reco_result"] for d in data_list] # for i in data1_all: # print("data1_0:",i[0]) # print("data5",data5_all) # print("data1",data1_all) def get_data(data1_all, data2_all, data5_all): col_value = [] for data1, data2, data5 in zip(data1_all, data2_all, data5_all): department = data1[0] is_invoice = data1[1] imageName = data1[2] if 'code' in data2: code = str(data2).split(",")[0].split(":")[1].replace("'", "").replace("}", "") description = str(data2).split(",")[1].split(":")[1].replace("'", "").replace("}", "") else: code = "" description = "" if 'total' in data5: total = str(data5).split(",")[0].split(":")[1].replace("'", "").replace("}", "") invoice_no = str(data5).split(",")[1].split(":")[1].replace("'", "").replace("}", "") create_date = str(data5).split(",")[2].split(":")[1].replace("'", "").replace("}", "") check_code = str(data5).split(",")[3].split(":")[1].replace("'", "").replace("}", "") invoice_code = str(data5).split(",")[4].split(":")[1].replace("'", "").replace("}", "") else: total = "" invoice_no = "" create_date = "" check_code = "" invoice_code = "" col_value.append((department, is_invoice, imageName, code, description, total, invoice_no, create_date, check_code, invoice_code)) return col_value col_value = get_data(data1_all, data2_all, data5_all) df = pd.DataFrame(col_value, index=None, columns=["department", "is_invoice", "imageName", "code", "description", "total", "invoice_no", "create_date", "check_code", "invoice_code"]) df.to_excel('excel_pd.xls') print(df)
结果:
department is_invoice ... check_code invoice_code 0 xxt 1 ... 1 xxt2 0 ... [2 rows x 10 columns]
-----自己处理数据
import json import pandas as pd # 从字符串直接获取 # data_str='''{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5ce37539e60640002b771cfa","Teams":[{"TeamOid":"5ce3af2f0e87f8002caad07e"}],"Name":"图森未来","Avatar":"https://cdn.kesci.com/upload/image/pruh63s30o.jpg","Job":"AIWIN大赛","Company":"上海图森未来人工智能科技有限公司","MajorTag":null,"Avatar":"L4级无人驾驶卡车"}''' # data_list = json.loads(data_str) # 从文件读取 data_str = open('a.json',encoding='utf-8').read() data_list = json.loads(data_str) data_s_all = [[ d["StudentFlag"], d["Name"],d["Job"],d["Company"]] for d in data_list] data_Location_all = [d["Location"] for d in data_list] data_University_all = [d["University"] for d in data_list] data_MajorTag_all = [d["MajorTag"] for d in data_list] # for i in data1_all: # print("data1_0:",i[0]) # print("data5",data5_all) # print("data1",data1_all) def get_data(data_s_all, data_Location_all, data_University_all,data_MajorTag_all): col_value = [] for data_s, data_Location, data_University,data_MajorTag in zip(data_s_all, data_Location_all, data_University_all,data_MajorTag_all): StudentFlag = data_s[0] Name = data_s[1] Job = data_s[2] Company = data_s[3] Location_result='' University_result='' Major='' # Location try: if 'L1' in data_Location: L1 = str(data_Location).split(",")[0].split(":")[1].replace("'", "").replace("}", "") L2 = str(data_Location).split(",")[1].split(":")[1].replace("'", "").replace("}", "") L3 = str(data_Location).split(",")[2].split(":")[1].replace("'", "").replace("}", "") Location_result=L1+L2+L3 else: L1 = "" L2 = "" L3 = "" except Exception as e: None # Country try: if 'Country' in data_University: Country = str(data_University).split(",")[0].split(":")[1].replace("'", "").replace("}", "") Province = str(data_University).split(",")[1].split(":")[1].replace("'", "").replace("}", "") University = str(data_University).split(",")[2].split(":")[1].replace("'", "").replace("}", "") University_result=Country+Province+University else: Country = "" Province = "" University = "" except Exception as e: None # Major try: if 'Major' in data_MajorTag: Major=str(data_MajorTag).split(",")[1].split(":")[1].replace("'", "").replace("}", "") else: Major = "" except Exception as e: None col_value.append((Name, StudentFlag, Job, Company, Location_result, University_result, Major)) return col_value col_value = get_data(data_s_all, data_Location_all, data_University_all,data_MajorTag_all) df = pd.DataFrame(col_value, index=None,columns=["Name", "StudentFlag", "Job", "Company", "Location_result", "University_result", "Major"]) df.to_excel('excel_pd2.xls') print(df.head(3))
运行结果:
Name StudentFlag Job ... Location_result University_result Major 0 soyotec 0 总经理 ... 北京 北京市 1 ne12212 0 产品经理 ... 上海 上海市 2 图森未来 0 AIWIN大赛 ... 上海 上海市
技巧 :将复杂的json串整理成以下格式再读取,再使用 data_list = json.loads(data_str) 读取即可
{"error_code":40007,"error_msg":"fail to recognize"}[{"department": "abcdef", "query_result": {"code": "1000", "description": "1000"}, "is_invoice": 1, "imageName": "./imgs/8888888.jpeg", "reco_result": {"total": "", "invoice_no": "123", "create_date": "", "check_code": ""}}]1234567
批量读取 json 文件(中文 json)
./out_file 下两个json文件内容如下:
out_01.txt 内容为:"{"name_ID":"12343","name":"张三","身份编码":"未知"}" out_02.txt 内容为:"{"name_ID":"12344","name":"李四","身份编码":"98983"}"12
import jsonimport osdef img_w_h(text_path): data_str_list = [] img_name_list = [] for filename in os.listdir(text_path): file_path = text_path+'/'+filename print("获取文件:",file_path) data_str = open(file_path,"r",encoding='UTF-8').read() data_str_list.append(data_str) img_name_list.append(filename) print("data_str_list",data_str_list) return data_str_list,img_name_listdef json_to_excel(data_str_list): data_all = [] for data_str in data_str_list: if data_str.startswith(u'\ufeff'): content = data_str.encode('utf8')[3:].decode('utf8') text = json.loads(content[1:-1]) if text["身份编码"] =="未知": data_all.append(text["身份编码"]) return data_allif __name__ == "__main__": text_path = "./out_file" data_str_list, img_name_list = img_w_h(text_path) data_all = json_to_excel(data_str_list) print("data_all:",data_all)输出: 获取文件: ./out_file/out_01.txt 获取文件: ./out_file/out_02.txt data_str_list ['\ufeff"{"name_ID":"12343","name":"张三","身份编码":"98983"}"', '\ufeff"{"name_ID":"12343","name":"张三","身份编码":"未知"}"']data_all: ['未知']123456789101112131415161718192021222324252627282930313233343536
复杂json格式解析 保存 Excel
import jsonimport pandas as pd""" 数据格式一(为方便查看格式化如下): [{"department": "abcdef", "query_result": {"code": "1000", "description": "1000"}, "is_invoice": 1, "imageName": "./imgs/8888888.jpeg", "reco_result": {"total": "", "invoice_no": "01111111", "create_date": "", "check_code": "", "invoice_code": ""}}, {"department": "abcdef", "query_result": {}, "is_invoice": 0, "imageName": "./imgs/51111111.jpeg", "reco_result": {}}, ...] """data_str = open('json_img.json').read()data_list = json.loads(data_str)data1_all = [[d["department"], d["is_invoice"], d["imageName"]] for d in data_list]data2_all = [d["query_result"] for d in data_list]data5_all = [d["reco_result"] for d in data_list]for i in data1_all: print("data1_0:",i[0])print("data5",data5_all)print("data1",data1_all)def get_data(data1_all,data2_all,data5_all): col_value = [] for data1,data2,data5 in zip(data1_all,data2_all,data5_all): department = data1[0] is_invoice = data1[1] imageName = data1[2] if 'code' in data2: code = str(data2).split(",")[0].split(":")[1].replace("'", "").replace("}", "") description = str(data2).split(",")[1].split(":")[1].replace("'", "").replace("}", "") else: code = "NAN" description = "NAN" if 'total' in data5: total = str(data5).split(",")[0].split(":")[1].replace("'", "").replace("}", "") invoice_no = str(data5).split(",")[1].split(":")[1].replace("'", "").replace("}", "") create_date = str(data5).split(",")[2].split(":")[1].replace("'", "").replace("}", "") check_code = str(data5).split(",")[3].split(":")[1].replace("'", "").replace("}", "") invoice_code = str(data5).split(",")[4].split(":")[1].replace("'", "").replace("}", "") else: total = "NAN" invoice_no = "NAN" create_date = "NAN" check_code = "NAN" invoice_code = "NAN" col_value.append((department,is_invoice,imageName, code,description, total, invoice_no, create_date, check_code, invoice_code)) return col_value col_value = get_data(data1_all,data2_all,data5_all)df = pd.DataFrame(col_value, index=None,columns=["department", "is_invoice", "imageName", "code", "description", "total", "invoice_no", "create_date", "check_code", "invoice_code"])df.to_excel('excel_pd.xls')1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
读取excel中某列的json数据(每个单元格数据格式为上面:数据格式一)
import jsonimport pandas as pdimport xlrd excel_path = "C:\\Users\\Desktop\\test_data.xlsx"def read_excel(excel_path): workbook = xlrd.open_workbook(excel_path) sheet = workbook.sheet_by_name("Sheet1") nrows = sheet.nrows list1 = [] for i in range(1,nrows): list1.append(sheet.row_values(i)[0]) return list1def get_data(excel_path): list1 = read_excel(excel_path) All_data = [] for i in range(len(list1)): #遍历列表数据(相当于遍历该列所有单元格) data_list = json.loads(list1[i]) # print("data_list:", type(data_list)) for i in range(len(data_list)): #遍历该单元格列表中所有json串 # print(type(data_list[i])) data_dict = data_list[i] try: imageNo = data_dict["imageNo"] businessType = data_dict["businessType"] reco_result = data_dict["reco_result"] try: total = reco_result["total"] invoice_no = reco_result["invoice_no"] create_date = reco_result["create_date"] check_code = reco_result["check_code"] invoice_code = reco_result["invoice_code"] except: total = "NAN" invoice_no = "NAN" create_date = "NAN" check_code = "NAN" invoice_code = "NAN" is_invoice = data_dict["is_invoice"] billId = data_dict["billId"] imageName = data_dict["imageName"] applyNum = data_dict["applyNum"] department = data_dict["department"] query_result = data_dict["query_result"] try: code = query_result["code"] description = query_result["description"] except: code = "NAN" description = "NAN" All_data.append((imageNo, businessType, total, invoice_no, create_date, check_code, invoice_code, is_invoice, billId, imageName, applyNum, department, code, description)) except: print("数据格式出错!") pass return All_data All_data = get_data(excel_path)df = pd.DataFrame(All_data, index=None,columns=["imageNo", "businessType", "total","invoice_no", "create_date", "check_code", \ "invoice_code","is_invoice","billId","imageName",\ "applyNum","department","code","description"])df.to_excel('C:\\Users\\Desktop/001.xls')print("done!")12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
特殊json文件格式化
最原始数据:{"41196516":"{\"type\":\"身份证正面\",\"name\":\"徐XX\",\"sex\":\"男\",\"people\":\"汉\",...,"41196243":"{\"error_code\"处理成如下json文件:(非常不正规){"41196516":"{"type":"身份证正面","name":"徐XX","sex":"男","people":"汉","birthday":"19XX年7月XX日","address":"广州市花都区*****号","id_number":"4401***15","issue_authority":"广州市XXX局","validity":"20XX.XX.13-20XX.XX.13","time_cost":{"recognize":348,"preprocess":28},"complete":true,"border_covered":false,"head_covered":false,"head_blurred":false,"gray_image":true,"error_code":0,"error_msg":"OK"}","41196243":"{"error_code":40007,"error_msg":"fail to recognize"}","41196510":"{"type":"二代身份证","name":"魏XX","sex":"男","people":"汉","birthday":"19XX年9月XX日","address":"江苏省江阴市XXX号","id_number":"320XXX17","time_cost":{"recognize":398,"preprocess":29},"complete":true,"border_covered":false,"head_covered":false,"head_blurred":false,"gray_image":false,"error_code":0,"error_msg":"OK"}","41197139":"{"type":"身份证背面","issue_authority":"佛山市XXX分局","validity":"2005.XX.XX-2025.XX.XX","time_cost":{"recognize":464,"preprocess":48},"complete":true,"error_code":0,"error_msg":"OK"}"}格式化展示:{"41196516":"{"type":"身份证正面", "name":"徐XX", "sex":"男", "people":"汉", "birthday":"19XX年7月XX日", "address":"广州市花都区*****号", "id_number":"4401***15", "issue_authority":"广州市XXX局", "validity":"20XX.XX.13-20XX.XX.13", "time_cost":{"recognize":348,"preprocess":28}, "complete":true, "border_covered":false, "head_covered":false, "head_blurred":false, "gray_image":true, "error_code":0, "error_msg":"OK"}","41196243":"{"error_code":40007,"error_msg":"fail to recognize"}","41196510":"{"type":"二代身份证", "name":"魏XX", "sex":"男", "people":"汉", "birthday":"19XX年9月XX日", "address":"江苏省江阴市XXX号", "id_number":"320XXX17", "time_cost":{"recognize":398,"preprocess":29}, "complete":true, "border_covered":false, "head_covered":false, "head_blurred":false, "gray_image":false, "error_code":0, "error_msg":"OK"}","41197139":"{"type":"身份证背面", "issue_authority":"佛山市XXX分局", "validity":"2005.XX.XX-2025.XX.XX", "time_cost":{"recognize":464,"preprocess":48}, "complete":true, "error_code":0, "error_msg":"OK"}"}123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
解析代码如下:
import jsonimport pandas as pd data_str = open('D:/XXX/XXX文档/reize_result20181227.txt',"r",encoding="utf-8").read()data_str0 = data_str.replace("\\","")print(data_str0)imgName_list = []def get_data(data_str0): All_data = [] num = data_str0.count("error_code") #统计共有多少个json("error_code"每个json都有) for i in range(num): imgName = data_str0[1:-1].split("\":\"{")[i][-8:] #获取ImageID([1:-1]去除最外层括号) print("imgName", imgName) img_str1 = "{"+data_str0[1:-1].split("\":\"{")[i+1].split("}\",\"")[0]+"}" #获取整个json img_str1 = img_str1.replace("\"}\"}}","\"}") if "\"}\"}" in img_str1 else img_str1 #去除末尾多余的符号 print("img_str1:", img_str1) data_list = json.loads(img_str1) ######################################################################### try: type_ = data_list["type"] except: type_ = "NAN" try: name = data_list["name"] except: name = "NAN" try: sex = data_list["sex"] except: sex = "NAN" try: people = data_list["people"] except: people = "NAN" try: birthday = data_list["birthday"] except: birthday = "NAN" try: address = data_list["address"] except: address = "NAN" try: id_number = data_list["id_number"] except: id_number = "NAN" try: issue_authority = data_list["issue_authority"] except: issue_authority = "NAN" try: validity = data_list["validity"] except: validity = "NAN" try: time_cost = data_list["time_cost"] recognize = time_cost["recognize"] preprocess = time_cost["preprocess"] except: time_cost = "NAN" recognize = "NAN" preprocess = "NAN" try: complete = data_list["complete"] except: complete = "NAN" try: border_covered = data_list["border_covered"] except: border_covered = "NAN" try: head_covered = data_list["head_covered"] except: head_covered = "NAN" try: head_blurred = data_list["head_blurred"] except: head_blurred = "NAN" try: gray_image = data_list["gray_image"] except: gray_image = "NAN" error_code = data_list["error_code"] error_msg = data_list["error_msg"] All_data.append((imgName,type_,name,sex,people,birthday,\ address,id_number,issue_authority,validity,\ recognize,preprocess,complete,border_covered,\ head_covered,head_blurred,gray_image,error_code,error_msg)) return All_data All_data = get_data(data_str0)df = pd.DataFrame(All_data, index=None,columns=["imgName", "type_", "name","sex", "people", "birthday", \ "address","id_number","issue_authority","validity",\ "recognize","preprocess","complete","border_covered",\ "head_covered","head_blurred","gray_image","error_code","error_msg"])df.to_excel('D:/XXX/XXX文档/reize_result20181227.xls')123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
复杂json解析
- 报错1: json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)
原因在于:字符串里用单引号来标识字符。
解决办法如下:将字符串里的 单引号 替换成 双引号
写json文件
import jsonimport osdef get_img(file_path): img_path = [] for path,dirname,filenames in os.walk(file_path): for filename in filenames: img_path.append(path+"/"+filename) return img_path def json_str(file_path): dict_str = [] img_path = get_img(file_path) for i in img_path: dict_str.append({"ImageName":"/image/bus/"+i,"id":"8abs63twy2001"}) return dict_str file_path = "./image/ocr"dict_str = json_str(file_path)json_str = json.dumps(dict_str)with open("./dict_str_to_json.json","w") as json_file: json_file.write(json_str) print("done!")12345678910111213141516171819202122232425
标签:code,invoice,处理,list,json,str,data,pandas 来源: https://blog.51cto.com/lhrbest/2706533