大家好,我是考100的代码小小顾,祝大家学习进步,加薪顺利呀。今天说一说爬虫Python代码_python开源代码,希望您对编程的造诣更进一步.
python爬虫
爬虫的概念
- 爬虫是模拟浏览器发送请求,获取响应
爬虫的流程
- 发送请求
- 获取响应
- 提取数据
- 保存
请求头
通过请求头模拟模拟服务器
- Host:主机和端口号
- Connection:链接类型
- Upgrade-Insecure-Requests:升级为HTTPS请求
- Accept:传输文件类型
- Referer:页面跳转处
- Accept-Encoding:文件编解码格式
- Cookie:cookie
- x-requested-with:Ajax异步请求
响应状态码
- 200:成功
- 302:临时转移至新的url
- 307:临时转移至新的url
- 404:not found
- 500:服务器内部错误
爬虫的类
- 通用爬虫:通常指搜索引擎的爬虫
- 聚焦爬虫:针对特定网站的爬虫
爬虫的工作流程
搜索引擎流程
- 抓去网页
- 数据存储
- 预处理
- 提供检索服务、网站排名
聚焦爬虫流程
- url list<———-
- 响应内容 |
- 提取数据—->提取url
- 入库
发送简单的请求
- 安装requests库
pip3 install requests
代码100分
- 发起一个简单的request请求
代码100分import requests response = requests.get("http://www.baidu.com") response.encoding = "utf-8" text = response.text #返回内容 status = response.status_code print(text) print(status)
- 发起一个带headers的请求
import requests header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"} response = requests.get("http://www.baidu.com",headers = header) response.encoding = "utf-8" text = response.text #返回内容 status = response.status_code print(text) print(status ) print(response.request.headers)
- 发起带参数的请求
代码100分import requests header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"} par = {"wd":"张三"} response = requests.get("http://www.baidu.com/s",headers = header,params=par) response.encoding = "utf-8" text = response.text #返回内容 status = response.status_code print(text) print(status ) print(response.request.headers) print(response.request.url)
发送post请求
- 发送post请求
import requests
def get_content_length(data):
length = len(data.keys()) * 2 – 1
total = ”.join(list(data.keys()) + list(data.values()))
length += len(total)
return length
url = “http://localhost/”
par = {“txt_userName”:”admin”,
“txt_userPwd”:”123″,
“but_sigin”:”登录”,
“__EVENTTARGET”:””,
“_EVENTARGUMENT”:””,
“__VIEWSTATE”:”jvSnUXEL/VE5n7y6wjx8T+if6kxOtL5RZwOROEHiWmIseLfbsua+mkFpteAxZMTrtRVgaO7cQYj90Ziw1hvSv7KeCChJga9R4DYPeP77Ypw=”,
“__VIEWSTATEGENERATOR”:”9005994241″,
“__EVENTVALIDATION”:”s/6wUJU3A8q2IrZV4ockZ4bKJm1jj4l1IEJm/C1OQSyauSTIxHbtAXVI9DP8ARz9X0iFrjeted/manpeRySaa7fU+T1ssbkfEfNB0MkmEE347WV9/jow73gaNCnKWVg2REhDPfYJ/LR+oLQrqBqDawKEly5WTksOlKgVmxF7+Gc=”}
print(get_content_length(par))
header = {“Referer”:”http://localhost/”,
“Origin”:”http://localhost”,
“Content-Type”:”application/x-www-form-urlencoded”,
“Connection”:”keep-alive”,
“Content-Length”:str(get_content_length(par)),
“User-Agent”:”Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER”}
response = requests.post(url,data = par,headers = header)
text = response.text
status = response.status_code
print(text)
print(status)
print(response.request.headers)
print(response.headers)
“`
代理
import requests header = {"Connection":"keep-alive", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"} proxies = {"http":"118.163.120.181:52458"}#设置代理ip response = requests.get("https://www.baidu.com",headers = header,proxies=proxies) response.encoding="utf-8" print(response.text)
保持session
import requests def get_content_length(data): length = len(data.keys()) * 2 - 1 total = ''.join(list(data.keys()) + list(data.values())) length += len(total) return length url = "http://localhost/" par = {"txt_userName":"admin", "txt_userPwd":"123", "but_sigin":"登录", "__EVENTTARGET":"", "_EVENTARGUMENT":"", "__VIEWSTATE":"jvSnUXEL/VE5n7y6wjx8T+if6kxOtL5RZwOROEHiWmIseLfbsua+mkFpteAxZMTrtRVgaO7cQYj90Ziw1hvSv7KeCChJga9R4DYPeP77Ypw=", "__VIEWSTATEGENERATOR":"9005994241", "__EVENTVALIDATION":"s/6wUJU3A8q2IrZV4ockZ4bKJm1jj4l1IEJm/C1OQSyauSTIxHbtAXVI9DP8ARz9X0iFrjeted/manpeRySaa7fU+T1ssbkfEfNB0MkmEE347WV9/jow73gaNCnKWVg2REhDPfYJ/LR+oLQrqBqDawKEly5WTksOlKgVmxF7+Gc="} print(get_content_length(par)) header = {"Referer":"http://localhost/", "Origin":"http://localhost", "Content-Type":"application/x-www-form-urlencoded", "Connection":"keep-alive", "Content-Length":str(get_content_length(par)), "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"} session = requests.session(); session.post(url,data = par,headers = header) header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"} response = session.get("http://localhost/ChooseFunc.aspx",headers = header) print(response.text)
使用cookie登录
import requests def get_content_length(data): length = len(data.keys()) * 2 - 1 total = ''.join(list(data.keys()) + list(data.values())) length += len(total) return length url = "http://localhost/" par = {"txt_userName":"admin", "txt_userPwd":"123", "but_sigin":"登录", "__EVENTTARGET":"", "_EVENTARGUMENT":"", "__VIEWSTATE":"jvSnUXEL/VE5n7y6wjx8T+if6kxOtL5RZwOROEHiWmIseLfbsua+mkFpteAxZMTrtRVgaO7cQYj90Ziw1hvSv7KeCChJga9R4DYPeP77Ypw=", "__VIEWSTATEGENERATOR":"9005994241", "__EVENTVALIDATION":"s/6wUJU3A8q2IrZV4ockZ4bKJm1jj4l1IEJm/C1OQSyauSTIxHbtAXVI9DP8ARz9X0iFrjeted/manpeRySaa7fU+T1ssbkfEfNB0MkmEE347WV9/jow73gaNCnKWVg2REhDPfYJ/LR+oLQrqBqDawKEly5WTksOlKgVmxF7+Gc="} print(get_content_length(par)) header = {"Referer":"http://localhost/", "Origin":"http://localhost", "Content-Type":"application/x-www-form-urlencoded", "Connection":"keep-alive", "Content-Length":str(get_content_length(par)), "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"} Cookie = "BAIDUID=FE80207B022A6E1E4D7BB1367A17F521:FG=1; BIDUPSID=FE80207B022A6E1E4D7BB1367A17F521; PSTM=1530773661; MCITY=-227%3A; BDUSS=U94WDlNWE9UNXdPQ2hLZUFJb3IzT1d6MVBra1J1MXBNTGxleDg5cElDLWx6dkpiQVFBQUFBJCQAAAAAAAAAAAEAAABjihIwxKu05unkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKVBy1ulQctbW; shifen[72403812623_78176]=1540113943; BCLID=14310731544559705912; BDSFRCVID=XaPsJeC629x9PMR7z-PHuUya6eMiOacTH6aomc6Zd6LmrCfkf5B3EG0PDf8g0KAbHA5togKKWeOTHxRP; H_BDCLCKID_SF=tRKDoC02tCI3fP36q4nSb-De2fob-C62aKDshnjx-hcqEIL4eJAB0MuwjpCLQUvtQ65K_J5z2l7HHUbSj4QohtJBWx8qWPrH0KcW0D5Pth5nhMJeb67JDMPF-47CtR3y523i-b5vQpnWVxtu-n5jHjQXDGKH3J; H_PS_645EC=008cULc%2Bp%2Fdi4lUhaDQ3S%2Fa27VwOr8z2caFujxq6eaPb1yA6qq8yWL76nYy%2F0mIg4K%2FPyQ; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; delPer=0; BD_CK_SAM=1; PSINO=6; BDRCVFR[4Zjqyl1bxbt]=aeXf-1x8UdYcs; BD_HOME=1; H_PS_PSSID=1448_21079_27244; BD_UPN=16314753" #分割cookie字符串组合成字典 cookies = {item.split("=")[0]:item.split("=")[1] for item in Cookie.split("; ")} print(cookies) response = requests.post(url,data = par,headers = header,cookies = cookies) text = response.text status = response.status_code
求SSL 证书
requests.get(login_url,verify=False)
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
转载请注明出处: https://daima100.com/4216.html