其他分享
首页 > 其他分享> > pandas处理json数据

pandas处理json数据

作者:互联网

pandas处理json数据


将json串解析为DataFrame的方式主要有三种:

  1. 利用pandas自带的read_json直接解析字符串
  2. 利用json的loads和pandas的json_normalize进行解析
  3. 利用json的loads和pandas的DataFrame直接构造(这个过程需要手动修改loads得到的字典格式)


由于read_json直接对字符串进行的解析,其效率是最高的,但是其对JSON串的要求也是最高的,需要满足其规定的格式才能够读取。其支持的格式可以在pandas的官网点击打开链接可以看到。然而json_normalize是解析json串构造的字典的,其灵活性比read_json要高很多。但是令人意外的是,其效率还不如我自己解析来得快(自己解析时使用列表解析的功能比普通的for循环快很多)。当然最灵活的还是自己解析,可以在构造DataFrame之前进行一些简单的数据处理。

# -*- coding: UTF-8 -*-
from pandas.io.json import json_normalize
import pandas as pd
import json
import time
 
# 读入数据
data_str = open('data.json').read()
print data_str
 
# 测试json_normalize
start_time = time.time()
for i in range(0, 300):
    data_list = json.loads(data_str)
    df = json_normalize(data_list)
end_time = time.time()
print end_time - start_time
 
# 测试自己构造
start_time = time.time()
for i in range(0, 300):
    data_list = json.loads(data_str)
    data = [[d['timestamp'], d['value']] for d in data_list]
    df = pd.DataFrame(data, columns=['timestamp', 'value'])
end_time = time.time()
print end_time - start_time
 
#  测试read_json
start_time = time.time()
for i in range(0, 300):
    df = pd.read_json(data_str, orient='records')
end_time = time.time()
print end_time - start_time

 








pandas里的read_json函数可以将json数据转化为dataframe。   pandas.read_json的语法如下:  

pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')

 

第一参数就是json文件路径或者json格式的字符串。

第二参数orient是表明预期的json字符串格式。orient的设置有以下几个值:

(1).'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。

import pandas as pd
s='{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
print(pd.read_json(s,orient='split'))

 

运行结果:

   a  b
1  1  3
2  2  8
3  3  9

 


 (2).   'records' : list like [{column -> value}, ... , {column -> value}]

这种就是成员为字典的列表。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。

import pandas as pd
s='[{"name":"xiaomaimiao","age":20},{"name":"xxt","age":18},{"name":"xmm","age":1}]'
print(pd.read_json(s,orient='records'))

 

运行结果:

   age         name
0   20  xiaomaimiao
1   18          xxt
2    1          xmm

 

再例如:

# coding=utf-8
import pandas as pd
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
s=open('a.json', encoding='UTF-8').read()
df=pd.read_json(s,orient='records')
print(df.head(5))
# df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"])
df.to_excel('pandas处理json1.xlsx', index=False)

 

运行结果:

数据:  icon_rar.gif a.json.zip

或:  https://raw.githubusercontent.com/lhrbest/Python/master/xxt_test_json.json



 (3).  'index' : dict like {index -> {column -> value}}

以索引为key,以列字段构成的字典为键值。如:

import pandas as pd
s='{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
print(pd.read_json(s,orient='index'))

 

 运行结果:  

   a   b
0  1   2
1  9  11

 


 (4).   'columns' : dict like {column -> {index -> value}}

这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:

import pandas as pd
s='{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
print(pd.read_json(s,orient='columns'))

 

运行结果:

   a   b
0  1   2
1  9  11

 


   (5).    'values' : just the values array

values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。

import pandas as pd
s='[["a",1],["b",2]]'
print(pd.read_json(s,orient='values'))

 

运行结果:

   0  1
0  a  1
1  b  2

 





要处理的json字符串:

strtext='[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'

 

代码:

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'
df = pd.read_json(strtext, orient='records')
df.to_excel('pandas处理json.xlsx', index=False, columns=["ttery", "issue", "code", "code1", "code2", "time"])

 

运行结果:  最终写入excel如下图:

 


再例如:

# coding=utf-8
import pandas as pd
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
s='[{"Location":{"L1":"北京","L2":"北京市","L3":""},"University":null,"StudentFlag":0,"_id":"5cd682d61c0acc002c7a9282","Teams":[{"TeamOid":"5cd685cbe32417002baf7fb8"}],"Name":"soyotec","Avatar":"//cdn.kesci.com/images/avatar/4.jpg","Job":"总经理","Company":"北京树优信息技术有限公司","MajorTag":null},{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5cf094ffa0e904002bae85a1","Teams":[{"TeamOid":"5cf08ec12e1a1d002b3912ed"}],"Name":"ne12212","Avatar":"//cdn.kesci.com/images/avatar/1.jpg","Job":"产品经理","Company":"中国电信股份有限公司上海分公司","MajorTag":null},{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5ce37539e60640002b771cfa","Teams":[{"TeamOid":"5ce3af2f0e87f8002caad07e"}],"Name":"图森未来","Avatar":"https://cdn.kesci.com/upload/image/pruh63s30o.jpg","Job":"AIWIN大赛","Company":"上海图森未来人工智能科技有限公司","MajorTag":null,"Signature":"L4级无人驾驶卡车"}]'
print(pd.read_json(s,orient='records'))

 


结果:

                                              Avatar           Company      Job                             Location  MajorTag     Name  Signature  StudentFlag                                      Teams  University                       _id
0                //cdn.kesci.com/images/avatar/4.jpg      北京树优信息技术有限公司      总经理  {'L1': '北京', 'L2': '北京市', 'L3': ''}       NaN  soyotec        NaN            0  [{'TeamOid': '5cd685cbe32417002baf7fb8'}]         NaN  5cd682d61c0acc002c7a9282
1                //cdn.kesci.com/images/avatar/1.jpg   中国电信股份有限公司上海分公司     产品经理  {'L1': '上海', 'L2': '上海市', 'L3': ''}       NaN  ne12212        NaN            0  [{'TeamOid': '5cf08ec12e1a1d002b3912ed'}]         NaN  5cf094ffa0e904002bae85a1
2  https://cdn.kesci.com/upload/image/pruh63s30o.jpg  上海图森未来人工智能科技有限公司  AIWIN大赛  {'L1': '上海', 'L2': '上海市', 'L3': ''}       NaN     图森未来  L4级无人驾驶卡车            0  [{'TeamOid': '5ce3af2f0e87f8002caad07e'}]         NaN  5ce37539e60640002b771cfa

 






示例:

from pandas.io.json import json_normalize
data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},
              {'name': {'given': 'Mose', 'family': 'Regner'}},
              {'id': 2, 'name': 'Faye Raker'}]
print(json_normalize(data))

 

运行结果:

    id        name name.family name.first name.given name.last
0  1.0         NaN         NaN     Coleen        NaN      Volk
1  NaN         NaN      Regner        NaN       Mose       NaN
2  2.0  Faye Raker         NaN        NaN        NaN       NaN

 

示例:

from pandas.io.json import json_normalize
data = [{'state': 'Florida',
               'shortname': 'FL',
               'info': {
                    'governor': 'Rick Scott'
               },
               'counties': [{'name': 'Dade', 'population': 12345},
                           {'name': 'Broward', 'population': 40000},
                           {'name': 'Palm Beach', 'population': 60000}]},
          {'state': 'Ohio',
               'shortname': 'OH',
               'info': {
                    'governor': 'John Kasich'
               },
               'counties': [{'name': 'Summit', 'population': 1234},
                            {'name': 'Cuyahoga', 'population': 1337}]}]
result = json_normalize(data, 'counties', ['state', 'shortname',['info', 'governor']])
print(result)

 

运行结果:


         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

 

示例:

from pandas.io.json import json_normalize
data = {'A': [1, 2]}
result=json_normalize(data, 'A', record_prefix='Prefix.')
print(result)

 

运行结果:

   Prefix.0
0         1
1         2

 


示例:

#coding=utf-8
from pandas.io.json import json_normalize
import json
# 读入数据
data_str = open('a.json',encoding='utf-8').read()
data_list = json.loads(data_str)
df = json_normalize(data_list)
df.to_excel('1.xlsx', index=False )
print(df)

 

运行结果:

icon_rar.gif a.json.zip

或:   https://raw.githubusercontent.com/lhrbest/Python/master/xxt_test_json.json


示例:

# coding=utf-8
import json
import pandas as pd
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
data_str=open('a.json', encoding='UTF-8').read()
data_list = json.loads(data_str)
data = [[d["Location"], d["Company"]] for d in data_list]
df = pd.DataFrame(data, columns=["Company", "Location"])
print(df.head(5))
# df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"])
# df.to_excel('pandas处理json1.xlsx', index=False)

 

结果:

                               Company          Location
0  {'L1': '北京', 'L2': '北京市', 'L3': ''}      北京树优信息技术有限公司
1  {'L1': '上海', 'L2': '上海市', 'L3': ''}   中国电信股份有限公司上海分公司
2  {'L1': '上海', 'L2': '上海市', 'L3': ''}  上海图森未来人工智能科技有限公司
3  {'L1': '上海', 'L2': '上海市', 'L3': ''}              None
4  {'L1': '上海', 'L2': '上海市', 'L3': ''}                AI

 


示例:

import json
import pandas as pd
data_str = '''[{"department": "xxt",
 "query_result": {"code": "10", "description": "1000"}, 
 "is_invoice": 1, 
 "imageName": "./imgs/8888888.jpeg", 
 "reco_result": {"total": "", "invoice_no": "01111111", "create_date": "", "check_code": "", "invoice_code": ""}}, 
 {"department": "xxt2",
 "query_result": {}, 
 "is_invoice": 0, 
 "imageName": "./imgs/51111111.jpeg",
 "reco_result": {}}]'''
# 从文件读取
# data_str = open('a.json',encoding='utf-8').read()
# data_list = json.loads(data_str)
# 从字符串直接获取
data_list = json.loads(data_str)
data1_all = [[d["department"], d["is_invoice"], d["imageName"]] for d in data_list]
data2_all = [d["query_result"] for d in data_list]
data5_all = [d["reco_result"] for d in data_list]
# for i in data1_all:
#     print("data1_0:",i[0])
# print("data5",data5_all)
# print("data1",data1_all)
def get_data(data1_all, data2_all, data5_all):
    col_value = []
    for data1, data2, data5 in zip(data1_all, data2_all, data5_all):
        department = data1[0]
        is_invoice = data1[1]
        imageName = data1[2]
        if 'code' in data2:
            code = str(data2).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
            description = str(data2).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
        else:
            code = ""
            description = ""
        if 'total' in data5:
            total = str(data5).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
            invoice_no = str(data5).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
            create_date = str(data5).split(",")[2].split(":")[1].replace("'", "").replace("}", "")
            check_code = str(data5).split(",")[3].split(":")[1].replace("'", "").replace("}", "")
            invoice_code = str(data5).split(",")[4].split(":")[1].replace("'", "").replace("}", "")
        else:
            total = ""
            invoice_no = ""
            create_date = ""
            check_code = ""
            invoice_code = ""
        col_value.append((department, is_invoice, imageName, code, description, total, invoice_no, create_date,
                          check_code, invoice_code))
    return col_value
col_value = get_data(data1_all, data2_all, data5_all)
df = pd.DataFrame(col_value, index=None,
                  columns=["department", "is_invoice", "imageName", "code", "description", "total", "invoice_no",
                           "create_date", "check_code", "invoice_code"])
df.to_excel('excel_pd.xls')
print(df)

 


结果:

  department  is_invoice     ...      check_code invoice_code
0        xxt           1     ...                             
1       xxt2           0     ...                             
[2 rows x 10 columns]

 





-----自己处理数据

import json
import pandas as pd
# 从字符串直接获取
# data_str='''{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5ce37539e60640002b771cfa","Teams":[{"TeamOid":"5ce3af2f0e87f8002caad07e"}],"Name":"图森未来","Avatar":"https://cdn.kesci.com/upload/image/pruh63s30o.jpg","Job":"AIWIN大赛","Company":"上海图森未来人工智能科技有限公司","MajorTag":null,"Avatar":"L4级无人驾驶卡车"}'''
# data_list = json.loads(data_str)
# 从文件读取
data_str = open('a.json',encoding='utf-8').read()
data_list = json.loads(data_str)
data_s_all = [[ d["StudentFlag"], d["Name"],d["Job"],d["Company"]] for d in data_list]
data_Location_all = [d["Location"] for d in data_list]
data_University_all = [d["University"] for d in data_list]
data_MajorTag_all = [d["MajorTag"] for d in data_list]
# for i in data1_all:
#     print("data1_0:",i[0])
# print("data5",data5_all)
# print("data1",data1_all)
def get_data(data_s_all, data_Location_all, data_University_all,data_MajorTag_all):
    col_value = []
    for data_s, data_Location, data_University,data_MajorTag in zip(data_s_all, data_Location_all, data_University_all,data_MajorTag_all):
        StudentFlag = data_s[0]
        Name = data_s[1]
        Job = data_s[2]
        Company = data_s[3]
        Location_result=''
        University_result=''
        Major=''
        # Location
        try:
            if 'L1' in data_Location:
                L1 = str(data_Location).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
                L2 = str(data_Location).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
                L3 = str(data_Location).split(",")[2].split(":")[1].replace("'", "").replace("}", "")
                Location_result=L1+L2+L3
            else:
                L1 = ""
                L2 = ""
                L3 = ""
        except Exception as e:
            None
        # Country
        try:
            if 'Country' in data_University:
                Country = str(data_University).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
                Province = str(data_University).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
                University = str(data_University).split(",")[2].split(":")[1].replace("'", "").replace("}", "")
                University_result=Country+Province+University
            else:
                Country = ""
                Province = ""
                University = ""
        except Exception as e:
            None
        # Major
        try:
            if 'Major' in data_MajorTag:
                Major=str(data_MajorTag).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
            else:
                Major = ""
        except Exception as e:
            None
        col_value.append((Name, StudentFlag, Job, Company,  Location_result, University_result, Major))
    return col_value
col_value = get_data(data_s_all, data_Location_all, data_University_all,data_MajorTag_all)
df = pd.DataFrame(col_value, index=None,columns=["Name", "StudentFlag", "Job", "Company", "Location_result", "University_result", "Major"])
df.to_excel('excel_pd2.xls')
print(df.head(3))

 


 运行结果:

     Name  StudentFlag      Job  ...  Location_result University_result Major
0  soyotec            0      总经理  ...          北京 北京市                         
1  ne12212            0     产品经理  ...          上海 上海市                         
2     图森未来            0  AIWIN大赛  ...          上海 上海市

 




技巧  :将复杂的json串整理成以下格式再读取,再使用  data_list = json.loads(data_str)  读取即可

{"error_code":40007,"error_msg":"fail to recognize"}[{"department": "abcdef",
 "query_result": {"code": "1000", "description": "1000"}, 
 "is_invoice": 1, 
 "imageName": "./imgs/8888888.jpeg", 
 "reco_result": {"total": "", "invoice_no": "123", "create_date": "", "check_code": ""}}]1234567

 

批量读取 json 文件(中文 json)

./out_file  下两个json文件内容如下:

out_01.txt 内容为:"{"name_ID":"12343","name":"张三","身份编码":"未知"}"
out_02.txt 内容为:"{"name_ID":"12344","name":"李四","身份编码":"98983"}"12

 

import jsonimport osdef img_w_h(text_path):
    data_str_list = []
    img_name_list = []
    for filename in os.listdir(text_path):
        file_path = text_path+'/'+filename        print("获取文件:",file_path)
        data_str = open(file_path,"r",encoding='UTF-8').read()
        data_str_list.append(data_str)
        img_name_list.append(filename)
    print("data_str_list",data_str_list)
    return data_str_list,img_name_listdef json_to_excel(data_str_list):
    data_all = []
    for data_str in data_str_list:
        if data_str.startswith(u'\ufeff'):
            content = data_str.encode('utf8')[3:].decode('utf8')
            text = json.loads(content[1:-1])
            if text["身份编码"] =="未知":
                data_all.append(text["身份编码"])
    return data_allif __name__ == "__main__":
    text_path = "./out_file"
    data_str_list, img_name_list = img_w_h(text_path)
    data_all = json_to_excel(data_str_list)
    print("data_all:",data_all)输出:
获取文件: ./out_file/out_01.txt
获取文件: ./out_file/out_02.txt
data_str_list ['\ufeff"{"name_ID":"12343","name":"张三","身份编码":"98983"}"', '\ufeff"{"name_ID":"12343","name":"张三","身份编码":"未知"}"']data_all: ['未知']123456789101112131415161718192021222324252627282930313233343536

 

复杂json格式解析 保存 Excel

import jsonimport pandas as pd"""
数据格式一(为方便查看格式化如下):
[{"department": "abcdef",
 "query_result": {"code": "1000", "description": "1000"}, 
 "is_invoice": 1, 
 "imageName": "./imgs/8888888.jpeg", 
 "reco_result": {"total": "", "invoice_no": "01111111", "create_date": "", "check_code": "", "invoice_code": ""}}, 
 {"department": "abcdef",
 "query_result": {}, 
 "is_invoice": 0, 
 "imageName": "./imgs/51111111.jpeg", 
 "reco_result": {}},
 ...]
"""data_str = open('json_img.json').read()data_list = json.loads(data_str)data1_all = [[d["department"], d["is_invoice"], d["imageName"]] for d in data_list]data2_all = [d["query_result"] for d in data_list]data5_all = [d["reco_result"] for d in data_list]for i in data1_all:
    print("data1_0:",i[0])print("data5",data5_all)print("data1",data1_all)def get_data(data1_all,data2_all,data5_all):
    col_value = []
    for data1,data2,data5 in zip(data1_all,data2_all,data5_all):
        department = data1[0]
        is_invoice = data1[1]
        imageName = data1[2]
        if 'code' in data2:
            code = str(data2).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
            description = str(data2).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
        else:
            code = "NAN"
            description = "NAN"
        if 'total' in data5:
            total = str(data5).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
            invoice_no = str(data5).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
            create_date = str(data5).split(",")[2].split(":")[1].replace("'", "").replace("}", "")
            check_code = str(data5).split(",")[3].split(":")[1].replace("'", "").replace("}", "")
            invoice_code = str(data5).split(",")[4].split(":")[1].replace("'", "").replace("}", "")
        else:
            total = "NAN"
            invoice_no = "NAN"
            create_date = "NAN"
            check_code = "NAN"
            invoice_code = "NAN"
        col_value.append((department,is_invoice,imageName, code,description, total, invoice_no, create_date, check_code, invoice_code))
    return col_value
col_value = get_data(data1_all,data2_all,data5_all)df = pd.DataFrame(col_value, index=None,columns=["department", "is_invoice", "imageName", "code", "description", "total", "invoice_no", "create_date", "check_code", "invoice_code"])df.to_excel('excel_pd.xls')1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465

 

读取excel中某列的json数据(每个单元格数据格式为上面:数据格式一)

import jsonimport pandas as pdimport xlrd
excel_path = "C:\\Users\\Desktop\\test_data.xlsx"def read_excel(excel_path):
    workbook = xlrd.open_workbook(excel_path)
    sheet = workbook.sheet_by_name("Sheet1")
    nrows = sheet.nrows
    list1 = []
    for i in range(1,nrows):
        list1.append(sheet.row_values(i)[0])
    return list1def get_data(excel_path):
    list1 = read_excel(excel_path)
    All_data = []
    for i in range(len(list1)):          #遍历列表数据(相当于遍历该列所有单元格)
        data_list = json.loads(list1[i])
        # print("data_list:", type(data_list))
        for i in range(len(data_list)): #遍历该单元格列表中所有json串
            # print(type(data_list[i]))
            data_dict = data_list[i]
            try:
                imageNo = data_dict["imageNo"]
                businessType = data_dict["businessType"]
                reco_result = data_dict["reco_result"]
                try:
                    total = reco_result["total"]
                    invoice_no = reco_result["invoice_no"]
                    create_date = reco_result["create_date"]
                    check_code = reco_result["check_code"]
                    invoice_code = reco_result["invoice_code"]
                except:
                    total = "NAN"
                    invoice_no = "NAN"
                    create_date = "NAN"
                    check_code = "NAN"
                    invoice_code = "NAN"
                is_invoice = data_dict["is_invoice"]
                billId = data_dict["billId"]
                imageName = data_dict["imageName"]
                applyNum = data_dict["applyNum"]
                department = data_dict["department"]
                query_result = data_dict["query_result"]
                try:
                    code = query_result["code"]
                    description = query_result["description"]
                except:
                    code = "NAN"
                    description = "NAN"
                All_data.append((imageNo, businessType, total, invoice_no, create_date, check_code,
                                 invoice_code, is_invoice, billId, imageName,
                                 applyNum, department, code, description))
            except:
                print("数据格式出错!")
                pass
    return All_data
All_data = get_data(excel_path)df = pd.DataFrame(All_data, index=None,columns=["imageNo", "businessType", "total","invoice_no", "create_date", "check_code", \                                                 "invoice_code","is_invoice","billId","imageName",\                                                 "applyNum","department","code","description"])df.to_excel('C:\\Users\\Desktop/001.xls')print("done!")12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667

 

特殊json文件格式化

最原始数据:{"41196516":"{\"type\":\"身份证正面\",\"name\":\"徐XX\",\"sex\":\"男\",\"people\":\"汉\",...,"41196243":"{\"error_code\"处理成如下json文件:(非常不正规){"41196516":"{"type":"身份证正面","name":"徐XX","sex":"男","people":"汉","birthday":"19XX年7月XX日","address":"广州市花都区*****号","id_number":"4401***15","issue_authority":"广州市XXX局","validity":"20XX.XX.13-20XX.XX.13","time_cost":{"recognize":348,"preprocess":28},"complete":true,"border_covered":false,"head_covered":false,"head_blurred":false,"gray_image":true,"error_code":0,"error_msg":"OK"}","41196243":"{"error_code":40007,"error_msg":"fail to recognize"}","41196510":"{"type":"二代身份证","name":"魏XX","sex":"男","people":"汉","birthday":"19XX年9月XX日","address":"江苏省江阴市XXX号","id_number":"320XXX17","time_cost":{"recognize":398,"preprocess":29},"complete":true,"border_covered":false,"head_covered":false,"head_blurred":false,"gray_image":false,"error_code":0,"error_msg":"OK"}","41197139":"{"type":"身份证背面","issue_authority":"佛山市XXX分局","validity":"2005.XX.XX-2025.XX.XX","time_cost":{"recognize":464,"preprocess":48},"complete":true,"error_code":0,"error_msg":"OK"}"}格式化展示:{"41196516":"{"type":"身份证正面",
			  "name":"徐XX",
			  "sex":"男",
			  "people":"汉",
			  "birthday":"19XX年7月XX日",
			  "address":"广州市花都区*****号",
			  "id_number":"4401***15",
			  "issue_authority":"广州市XXX局",
			  "validity":"20XX.XX.13-20XX.XX.13",
			  "time_cost":{"recognize":348,"preprocess":28},
			  "complete":true,
			  "border_covered":false,
			  "head_covered":false,
			  "head_blurred":false,
			  "gray_image":true,
			  "error_code":0,
			  "error_msg":"OK"}","41196243":"{"error_code":40007,"error_msg":"fail to recognize"}","41196510":"{"type":"二代身份证",
			 "name":"魏XX",
			 "sex":"男",
			 "people":"汉",
			 "birthday":"19XX年9月XX日",
			 "address":"江苏省江阴市XXX号",
			 "id_number":"320XXX17",
			 "time_cost":{"recognize":398,"preprocess":29},
			 "complete":true,
			 "border_covered":false,
			 "head_covered":false,
			 "head_blurred":false,
			 "gray_image":false,
			 "error_code":0,
			 "error_msg":"OK"}","41197139":"{"type":"身份证背面",
		    "issue_authority":"佛山市XXX分局",
			"validity":"2005.XX.XX-2025.XX.XX",
			"time_cost":{"recognize":464,"preprocess":48},
			"complete":true,
			"error_code":0,
			"error_msg":"OK"}"}123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051

 

解析代码如下:

import jsonimport pandas as pd
data_str = open('D:/XXX/XXX文档/reize_result20181227.txt',"r",encoding="utf-8").read()data_str0 = data_str.replace("\\","")print(data_str0)imgName_list = []def get_data(data_str0):
    All_data = []
    num = data_str0.count("error_code")                        #统计共有多少个json("error_code"每个json都有)
    for i in range(num):
        imgName = data_str0[1:-1].split("\":\"{")[i][-8:]      #获取ImageID([1:-1]去除最外层括号)
        print("imgName", imgName)
        img_str1 = "{"+data_str0[1:-1].split("\":\"{")[i+1].split("}\",\"")[0]+"}"         #获取整个json
        img_str1 = img_str1.replace("\"}\"}}","\"}") if "\"}\"}" in img_str1 else img_str1 #去除末尾多余的符号
        print("img_str1:", img_str1)
        data_list = json.loads(img_str1)
        #########################################################################
        try:
            type_ = data_list["type"]
        except:
            type_ = "NAN"
        try:
            name = data_list["name"]
        except:
            name = "NAN"
        try:
            sex = data_list["sex"]
        except:
            sex = "NAN"
        try:
            people = data_list["people"]
        except:
            people = "NAN"
        try:
            birthday = data_list["birthday"]
        except:
            birthday = "NAN"
        try:
            address = data_list["address"]
        except:
            address = "NAN"
        try:
            id_number = data_list["id_number"]
        except:
            id_number = "NAN"
        try:
            issue_authority = data_list["issue_authority"]
        except:
            issue_authority = "NAN"
        try:
            validity = data_list["validity"]
        except:
            validity = "NAN"
        try:
            time_cost = data_list["time_cost"]
            recognize = time_cost["recognize"]
            preprocess = time_cost["preprocess"]
        except:
            time_cost = "NAN"
            recognize = "NAN"
            preprocess = "NAN"
        try:
            complete = data_list["complete"]
        except:
            complete = "NAN"
        try:
            border_covered = data_list["border_covered"]
        except:
            border_covered = "NAN"
        try:
            head_covered = data_list["head_covered"]
        except:
            head_covered = "NAN"
        try:
            head_blurred = data_list["head_blurred"]
        except:
            head_blurred = "NAN"
        try:
            gray_image = data_list["gray_image"]
        except:
            gray_image = "NAN"
        error_code = data_list["error_code"]
        error_msg = data_list["error_msg"]
        All_data.append((imgName,type_,name,sex,people,birthday,\
                         address,id_number,issue_authority,validity,\
                         recognize,preprocess,complete,border_covered,\
                         head_covered,head_blurred,gray_image,error_code,error_msg))
    return All_data
All_data = get_data(data_str0)df = pd.DataFrame(All_data, index=None,columns=["imgName", "type_", "name","sex", "people", "birthday", \                                                 "address","id_number","issue_authority","validity",\                                                 "recognize","preprocess","complete","border_covered",\                                                 "head_covered","head_blurred","gray_image","error_code","error_msg"])df.to_excel('D:/XXX/XXX文档/reize_result20181227.xls')123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107

 

复杂json解析

写json文件

import jsonimport osdef get_img(file_path):
	img_path = []
	for path,dirname,filenames in os.walk(file_path):
		for filename in filenames:
			img_path.append(path+"/"+filename)
	return img_path	
def json_str(file_path):
	dict_str = []
	img_path = get_img(file_path)
	for i in img_path:
		dict_str.append({"ImageName":"/image/bus/"+i,"id":"8abs63twy2001"})
	return dict_str
	
file_path = "./image/ocr"dict_str = json_str(file_path)json_str = json.dumps(dict_str)with open("./dict_str_to_json.json","w") as json_file:
	json_file.write(json_str)
	print("done!")12345678910111213141516171819202122232425

    

标签:code,invoice,处理,list,json,str,data,pandas
来源: https://blog.51cto.com/lhrbest/2706533