博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python爬虫(六)
阅读量:5141 次
发布时间:2019-06-13

本文共 2019 字,大约阅读时间需要 6 分钟。

源码:

1 import requests 2 import re 3 from my_mysql import MysqlConnect 4  5  6 # 获取问答信息 7 def get_contents(page,headers): 8     url = 'https://www.zhihu.com/api/v4/members/chen-lu-ya-26/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Creview_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cvoting%2Cis_author%2Cis_thanked%2Cis_nothelp%3Bdata%5B*%5D.author.badge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={}&limit=20&sort_by=created'.format(page) 9     req = requests.get(url,headers=headers)10     html_json_dict = req.json()11     # print(html_json_dict)12     data_list = html_json_dict['data']13     contents = []14     for item in data_list:15         question = item['question']['title']16         excerpt = item['excerpt']17         if '<' in excerpt:18             pat = r'(.*?)<.*>(.*)'19             res = re.search(pat, excerpt)20             front = res.group(1)21             back = res.group(2)22             pat = r'<.*?>(.*?)<.*?>'23             res = re.findall(pat, excerpt)24             middle = ' '.join(res)25             excerpt = front + middle + back26         contents.append((question,excerpt))27     return contents28 29 if __name__ == '__main__':30     headers = {31         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'32     }33     mc = MysqlConnect('127.0.0.1','root','123456','homework')34     for page in range(0,20*8,20):35         contents = get_contents(page, headers)36         # print(contents)37         for content in contents:38             sql = 'insert into zhihu values(null,%s,%s)'39             mc.exec_data(sql,content)40             print(content)

 

转载于:https://www.cnblogs.com/zhxd-python/p/9501313.html

你可能感兴趣的文章
Bitmap 算法
查看>>
转载 C#文件中GetCommandLineArgs()
查看>>
list control控件的一些操作
查看>>
绝望的第四周作业
查看>>
一月流水账
查看>>
npm 常用指令
查看>>
判断字符串在字符串中
查看>>
Linux环境下Redis安装和常见问题的解决
查看>>
HashPump用法
查看>>
cuda基础
查看>>
Vue安装准备工作
查看>>
oracle 创建暂时表
查看>>
201421410014蒋佳奇
查看>>
Xcode5和ObjC新特性
查看>>
LibSVM for Python 使用
查看>>
Centos 7.0 安装Mono 3.4 和 Jexus 5.6
查看>>
CSS属性值currentColor
查看>>
java可重入锁reentrantlock
查看>>
浅谈卷积神经网络及matlab实现
查看>>
解决ajax请求cors跨域问题
查看>>