百度地图爬虫——小试牛刀
作者:互联网
利用百度地图进行爬虫,首先要在百度地图API官网上注册并创建应用AK,如下图。
点击创建应用,设置应用名称。
其他设置按照系统提示选择,点击提交即可。
那么我们只需要应用的AK即可。如果免费的流量不够用,可以申请开发者得到更多的流量。下面进入正题。直接上代码。
import requests
from tqdm import tqdm
def baidu_map_search():
apk_key = "vSj23PqoC3nFXTOwW9xwRifMGiVjo3bV"#这部分写你自己的应用AK
url = "http://api.map.baidu.com/place/v2/search"#官方规定,不用改
types = [ '酒店', '美食' ]#爬虫数据分类,根据自己需求设置
with open('region.txt', 'r', encoding='utf-8') as f:
regions = f.read()
regions = regions.split('、')#regions里是我自己的搜索范围,这里我是按省市划分的
region_index = 0
type_index = 0
page = 0
f = open('pos_{}.txt'.format(type_index), 'w', encoding='utf-8')
while True:
params = {
"query": types[type_index],
"output": "json",
"ak": apk_key,
"region": regions[region_index],
"page_size": 20,
"page_num": page,
"scope": 1,
}#官方规定
page += 1
response = requests.get(url, params)
result = response.json()
status = result.get("status")
message = result.get("message")
if status != 0 and status != 2:
raise Exception(message)
data = result.get("results", {})
if len(data) == 0:
region_index += 1
page = 0
if region_index == len(regions):
region_index = 0
type_index += 1
f.close()
f = open('pos_{}.txt'.format(type_index), 'w', encoding='utf-8')
if type_index == len(types):
f.close()
return
print('{} {} page:{} num:{}'.format(
regions[region_index], types[type_index], page, len(data)))
for row in data:
item = {
"name": row.get("name", "")
}
for k, v in item.items():
if '市' in v:
continue
f.write(v.split('(')[0]+'\n')
if True:
baidu_map_search()
results = []
for i in range(11):
with open('pos_{}.txt'.format(i), 'r', encoding='utf-8') as f:
data = f.read().splitlines()
for v in tqdm(data):
if v[0] >= 'A' and v[0] <= 'z':
continue
if '州' in v or '县' in v or '市' in v or '区' in v :
continue
if v in results:
continue
results.append(v)
with open('pos_ch.txt', 'w', encoding='utf-8') as f:
for v in results:
f.write(v+'\n')
print(len(results))
除此之外,还有别的搜索规范,比如根据公里范围等等,具体的调用方式可以参考官网给的格式。
不喜勿喷,欢迎交流。
标签:index,region,results,爬虫,小试牛刀,regions,type,page,百度 来源: https://blog.csdn.net/gxf_1999/article/details/117628680