抓取乐视网视频评论信息

分析

仔细看timeline

第一个请求

  • url http://v.stat.le.com/vplay/queryMmsTotalPCount?callback=jQuery19108667127834739425_1568040457178&vid=1333316&_=1568040457189

  • 请求类型

    get请求

  • 请求参数

    callback=jQuery19108667127834739425_1568040457178&vid=1333316&_=1568040457189
    
  • 返回信息

    jQuery19108667127834739425_1568040457178({
      "down": 4,
      "is_comm": 1,
      "is_dm": 1,
      "media_play_count": 1886073,
      "up": 7,
      "vcomm_count": 1207,
      "vdm_count": 4399,
      "vnpcomm": 1144,
      "vreply": 128
    })
    

第二个请求

  • url
http://api.my.le.com/vcm/api/list?jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid=2&source=1&xid=1333316&pid=47494&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190`请求中获取到的。
  • 请求类型

    get请求

  • 请求参数

jsonp: jQuery19108667127834739425_1568040457178
type: video
rows: 20
page: 2
sort: 
cid: 2
source: 1
xid: 1333316
pid: 47494
ctype: cmt,img,vote
listType: 1
_: 1568040457190

jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid=2&source=1&xid=1333316&pid=47494&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190
  • 返回信息
    jQuery19108667127834739425_1568040457178({result: "200", total: 1207, replyTotal: 128,…})
    announcementData: []
    authData: []
    data: [{_id: "5348635461928101094", content: "从第一遍看,哭的稀里哗啦,到现在还是很感动!军人是最辛苦的!保家卫国!青春全部献给国家!值得尊敬!",…},…]
    godData: []
    replyTotal: 128
    result: "200"
    rule: 1
    topData: []
    total: 1207
    version_time: "2019-09-06 12:34:00"
    

猜想请求参数

  1. 从参数名称和前后请求关系
jsonp: jQuery19108667127834739425_1568040457178 ?
type: video
rows: 20 # 评论的条数
page: 2 #  页面
sort:  # 排序
cid: 2 # ?
source: 1 #?
xid: 1333316 #?
pid: 47494 #?
ctype: cmt,img,vote#?
listType: 1#?
_: 1568040457190#?
  1. 从页面源代码
http://www.le.com/ptv/vplay/1333316.html
jsonp: jQuery19108667127834739425_1568040457178 ?
type: video
rows: 20 # 评论的条数
page: 2 #  页面
sort:  # 排序
cid: 2 # 源代码中的 __INFO.cid
source: 1 #?
xid: 1333316 #源代码中的 __INFO.vid
pid: 47494 # 源代码中的 __INFO.pid
ctype: cmt,img,vote#?
listType: 1 #?
_: 1568040457190 #?

代码

import re
import json
import requests


class LetvSpider(object):
    COMMENT_URL = "http://api.my.le.com/vcm/api/list?jsonp=jQuery19108667127834739425_1568040457178&type=video&rows=20&page=2&sort=&cid={cid}&source=1&xid={xid}&pid={pid}&ctype=cmt%2Cimg%2Cvote&listType=1&_=1568040457190"

    HEADERS = {
        "Referer": "http: // www.le.com/ptv/vplay/1333316.html",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
    }

    def __init__(self, url):
        self.necessary_info = {}
        self.url = url
        self.get_necessary_id()
        self.get_comment()

    def get_request(self, url, headers):
        return requests.get(url, headers).content.decode()

    def get_necessary_id(self):
        source = self.get_request(self.url, self.HEADERS)
        cid = re.search('cid: (\d+)', source).group(1)
        vid = re.search('vid: (\d+)', source).group(1)
        pid = re.search('pid: (\d+)', source).group(1)
        self.necessary_info['cid'] = cid
        self.necessary_info['vid'] = vid
        self.necessary_info['pid'] = pid

    def get_comment(self):
        url = self.COMMENT_URL.format(
            cid=self.necessary_info['cid'], xid=self.necessary_info['vid'], pid=self.necessary_info['pid'])
        source = self.get_request(url, self.HEADERS)
        source_json = source[source.find('{"'):-1]
        comment_dict = json.loads(source_json)
        comments = comment_dict['data']
        for comment in comments:
            print(
                f'发帖人:{comment["user"]["username"]}, 评论内容:{comment["content"]}')


if __name__ == "__main__":
    spider = LetvSpider('http://www.le.com/ptv/vplay/1333316.html')