首页 > 编程语言> > python爬虫第一天

python爬虫第一天

2019-02-26 17:53:18 作者：互联网

1.首先是安装一个第三方库requests,pip install requests

#下载百度首页,requests库将下载结果封装为response类
response = requests.get("http://www.baidu.com")
#dir可以查看类的内部结构
#暴力调试可以了解类的内部方法行为
print(dir(response))
#text会使用默认的编码方式转换字符串
print(response.text)

2.介绍下with as 方法

#with as 能够实现自动清理的条件是相应类必须实现__enter__,__exit__两个魔法方法
with open('baidu.html','w+') as f:
#response.content是bytes类型使用decode方法还原为相应编码的字符串
f.write(response.content.decode('utf-8'))

3.乱码产生原因

产生乱码的原因：使用不同字库上的编码序号进行了解码操作.

例如:b'\xe4\xbd\xa0\xe5\xa5\xbd'在utf-8编码库中对应你好,在gbk编码库中对应'浣犲ソ'

4.进制转换

#转换为二进制
bin(10)
#转换为10进制
int(0xec)
#转换为16进制
hex(10)
#获取字符的编码位置
ord('你')

5.回顾http协议

什么是HTTP

HTTP是超文本传输协议,具体概念参考HTTP协议

主要用在浏览器和Web服务器之间通信
HTTP协议的主要特点

1.无状态

2.无连接

3.支持多种媒体格式
HTTP消息格式
- 请求和应答

使用POST方法实现有道翻译

1.使用火狐打开有道翻译

2.右键->检查元素

3.在弹出的调试界面选择网络,如下图

4.找到相应是json类型的地址，一般情况是我们的接口地址

5.选中相应接口地址，如上图，点击参数，在表单数据中是我们要提交的数据

6.分析完接口数据，添加如下代码:

while True:
    #构建post消息体
    c = input('有道翻译:')
    post_data = {}
    post_data['i'] = c
    post_data['doctype'] = 'json'
    #发出post请求
    response = requests.post('http://fanyi.youdao.com/translate', data=post_data)

    # print(type(response.text))
    dt = {}
    #解析返回json结果
    responses = json.loads(response.text)

    dt[responses['translateResult'][0][0]['src']] = responses['translateResult'][0][0]['tgt']
    print(dt[responses['translateResult'][0][0]['src']])

cookies和session

什么是cookies和session

1.主要解决HTTP协议无连接、无状态的特点，使服务器能够识别用户

2.cookies是保存在客户端的一组识别信息(例如会员卡),session是存在服务器端的数据

3.cookies和session通过seesionid关联

4.当客户机登录成功后，关闭了相关页面，一段时间以内(没有到超时时间)再次访问相关网页，浏览器会自动携带相应的cookies信息，并发送给服务器(cookies中携带sessionid),服务器检查相应sessionid的超时状态，如果没有超时，按照已登录提供服务
使用cookies实现免登录

练习：获取https://www.cnblogs.com/wangjingblogs/p/3192953.html 连接处描述的天气预报接口数据，并解析显示

response = requests.get('http://www.weather.com.cn/data/sk/101180101.html')
responses = json.loads(response.content.decode('utf-8'))
print(responses)
print(responses['weatherinfo']['city'],responses['weatherinfo']['WD'],responses['weatherinfo']['njd'],responses['weatherinfo']['WS'],responses['weatherinfo']['WSE'],responses['weatherinfo']['time'])

标签：cookies,HTTP,responses,第一天,python,爬虫,print,post,response
来源： https://blog.csdn.net/weixin_43788061/article/details/87939570