抓取乐视网视频评论信息

分析

仔细看timeline

第一个请求

url http://v.stat.le.com/vplay/queryMmsTotalPCount?callback=jQuery19108667127834739425_1568040457178&vid=1333316&_=1568040457189
请求类型

get请求

请求参数

callback=jQuery19108667127834739425_1568040457178&vid=1333316&_=1568040457189

返回信息

jQuery19108667127834739425_1568040457178({
  "down": 4,
  "is_comm": 1,
  "is_dm": 1,
  "media_play_count": 1886073,
  "up": 7,
  "vcomm_count": 1207,
  "vdm_count": 4399,
  "vnpcomm": 1144,
  "vreply": 128
})

第二个请求

http://api.my.le.com/vcm/api/list?jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid=2&source=1&xid=1333316&pid=47494&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190`请求中获取到的。

请求类型

get请求
请求参数

jsonp: jQuery19108667127834739425_1568040457178
type: video
rows: 20
page: 2
sort: 
cid: 2
source: 1
xid: 1333316
pid: 47494
ctype: cmt,img,vote
listType: 1
_: 1568040457190

jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid=2&source=1&xid=1333316&pid=47494&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190

返回信息

jQuery19108667127834739425_1568040457178({result: "200", total: 1207, replyTotal: 128,…})
announcementData: []
authData: []
data: [{_id: "5348635461928101094", content: "从第一遍看，哭的稀里哗啦，到现在还是很感动！军人是最辛苦的！保家卫国！青春全部献给国家！值得尊敬！",…},…]
godData: []
replyTotal: 128
result: "200"
rule: 1
topData: []
total: 1207
version_time: "2019-09-06 12:34:00"

猜想请求参数

从参数名称和前后请求关系

jsonp: jQuery19108667127834739425_1568040457178 ?
type: video
rows: 20 # 评论的条数
page: 2 #  页面
sort:  # 排序
cid: 2 # ?
source: 1 #?
xid: 1333316 #?
pid: 47494 #?
ctype: cmt,img,vote#?
listType: 1#?
_: 1568040457190#?

从页面源代码

http://www.le.com/ptv/vplay/1333316.html

jsonp: jQuery19108667127834739425_1568040457178 ?
type: video
rows: 20 # 评论的条数
page: 2 #  页面
sort:  # 排序
cid: 2 # 源代码中的 __INFO.cid
source: 1 #?
xid: 1333316 #源代码中的 __INFO.vid
pid: 47494 # 源代码中的 __INFO.pid
ctype: cmt,img,vote#?
listType: 1 #?
_: 1568040457190 #?

代码

import re
import json
import requests


class LetvSpider(object):
    COMMENT_URL = "http://api.my.le.com/vcm/api/list?jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid={cid}&source=1&xid={xid}&pid={pid}&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190"

    HEADERS = {
        "Referer": "http: // www.le.com/ptv/vplay/1333316.html",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
    }

    def __init__(self, url):
        self.necessary_info = {}
        self.url = url
        self.get_necessary_id()
        self.get_comment()

    def get_request(self, url, headers):
        return requests.get(url, headers).content.decode()

    def get_necessary_id(self):
        source = self.get_request(self.url, self.HEADERS)
        cid = re.search('cid: (\d+)', source).group(1)
        vid = re.search('vid: (\d+)', source).group(1)
        pid = re.search('pid: (\d+)', source).group(1)
        self.necessary_info['cid'] = cid
        self.necessary_info['vid'] = vid
        self.necessary_info['pid'] = pid

    def get_comment(self):
        url = self.COMMENT_URL.format(
            cid=self.necessary_info['cid'], xid=self.necessary_info['vid'], pid=self.necessary_info['pid'])
        source = self.get_request(url, self.HEADERS)
        source_json = source[source.find('{"'):-1]
        comment_dict = json.loads(source_json)
        comments = comment_dict['data']
        for comment in comments:
            print(
                f'发帖人:{comment["user"]["username"]}, 评论内容:{comment["content"]}')


if __name__ == "__main__":
    spider = LetvSpider('http://www.le.com/ptv/vplay/1333316.html')