抓取乐视网视频评论信息
分析
仔细看timeline
第一个请求
url
http://v.stat.le.com/vplay/queryMmsTotalPCount?callback=jQuery19108667127834739425_1568040457178&vid=1333316&_=1568040457189
请求类型
get请求
请求参数
callback=jQuery19108667127834739425_1568040457178&vid=1333316&_=1568040457189
返回信息
jQuery19108667127834739425_1568040457178({ "down": 4, "is_comm": 1, "is_dm": 1, "media_play_count": 1886073, "up": 7, "vcomm_count": 1207, "vdm_count": 4399, "vnpcomm": 1144, "vreply": 128 })
第二个请求
- url
http://api.my.le.com/vcm/api/list?jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid=2&source=1&xid=1333316&pid=47494&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190`请求中获取到的。
请求类型
get请求
请求参数
jsonp: jQuery19108667127834739425_1568040457178
type: video
rows: 20
page: 2
sort:
cid: 2
source: 1
xid: 1333316
pid: 47494
ctype: cmt,img,vote
listType: 1
_: 1568040457190
jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid=2&source=1&xid=1333316&pid=47494&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190
- 返回信息
jQuery19108667127834739425_1568040457178({result: "200", total: 1207, replyTotal: 128,…}) announcementData: [] authData: [] data: [{_id: "5348635461928101094", content: "从第一遍看,哭的稀里哗啦,到现在还是很感动!军人是最辛苦的!保家卫国!青春全部献给国家!值得尊敬!",…},…] godData: [] replyTotal: 128 result: "200" rule: 1 topData: [] total: 1207 version_time: "2019-09-06 12:34:00"
猜想请求参数
- 从参数名称和前后请求关系
jsonp: jQuery19108667127834739425_1568040457178 ?
type: video
rows: 20 # 评论的条数
page: 2 # 页面
sort: # 排序
cid: 2 # ?
source: 1 #?
xid: 1333316 #?
pid: 47494 #?
ctype: cmt,img,vote#?
listType: 1#?
_: 1568040457190#?
- 从页面源代码
http://www.le.com/ptv/vplay/1333316.html
jsonp: jQuery19108667127834739425_1568040457178 ?
type: video
rows: 20 # 评论的条数
page: 2 # 页面
sort: # 排序
cid: 2 # 源代码中的 __INFO.cid
source: 1 #?
xid: 1333316 #源代码中的 __INFO.vid
pid: 47494 # 源代码中的 __INFO.pid
ctype: cmt,img,vote#?
listType: 1 #?
_: 1568040457190 #?
代码
import re
import json
import requests
class LetvSpider(object):
COMMENT_URL = "http://api.my.le.com/vcm/api/list?jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid={cid}&source=1&xid={xid}&pid={pid}&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190"
HEADERS = {
"Referer": "http: // www.le.com/ptv/vplay/1333316.html",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
}
def __init__(self, url):
self.necessary_info = {}
self.url = url
self.get_necessary_id()
self.get_comment()
def get_request(self, url, headers):
return requests.get(url, headers).content.decode()
def get_necessary_id(self):
source = self.get_request(self.url, self.HEADERS)
cid = re.search('cid: (\d+)', source).group(1)
vid = re.search('vid: (\d+)', source).group(1)
pid = re.search('pid: (\d+)', source).group(1)
self.necessary_info['cid'] = cid
self.necessary_info['vid'] = vid
self.necessary_info['pid'] = pid
def get_comment(self):
url = self.COMMENT_URL.format(
cid=self.necessary_info['cid'], xid=self.necessary_info['vid'], pid=self.necessary_info['pid'])
source = self.get_request(url, self.HEADERS)
source_json = source[source.find('{"'):-1]
comment_dict = json.loads(source_json)
comments = comment_dict['data']
for comment in comments:
print(
f'发帖人:{comment["user"]["username"]}, 评论内容:{comment["content"]}')
if __name__ == "__main__":
spider = LetvSpider('http://www.le.com/ptv/vplay/1333316.html')