异步加载技术与请求头

异步加载技术 ajax

判断是否使用

在界面上显示的东西，在源代码中不可见，数据是经过请求之后以某种形式回传过来的，一般是json。这种情况下八成采用了异步加载技术。

`json`与列表或字典之间的相互转换

列表/字典转json

import json
json_1 = json.dumps(list_1, indent=4)
json_2 = json.dumps(dict_2, indent=4)

python中的None,在json中会变成null
python中的True和False，在json中会变成true和false
json的字符串总会使用双引号
中文在json中会变成Unicode码
为了方便阅读，可以采用缩进indent=4

json转为列表/字典

dict_1 =  json.loads(json_1)

模拟get

如果在控制台工具中显示为get请求，那么我只需要访问Request URL即可得到数据

# 获取大麦网第一页数据
def query(url):
    json_data = requests.get(url).content.decode()
    return json_data
content = query('https://search.damai.cn/searchajax.html?keyword=&cty=&ctl=&sctl=&tsg=0&st=&et=&order=1&pageSize=30&currPage=1&tn=')

模拟post

def query(url,params):
    json_data = requests.post(url, json=params).content.decode()
    return json_data

特殊的异步加载

加载对象不是数据，而是页面本身。

在network中并不是通过post或者get的方式获取到页面。

伪装成异步加载的后端渲染。

secret = '{"code": "\u884c\u52a8\u4ee3\u53f7\uff1a\u5929\u738b\u76d6\u5730\u864e"}'
secret_json = json.loads(secret)
print(secret_json)
# {'code': '行动代号：天王盖地虎'}

处理方法：首先找到数据，之后用正则表达式从页面中把数据提取出来，然后直接解析。

import json
import requests
import re


def query(url):
    html = requests.get(url).content.decode()
    return html


page_html = query('http://exercise.kingname.info/exercise_ajax_2.html')
code_json = re.search("secret = '(.*?)'", page_html, re.S).group(1)
code_dict = json.loads(code_json)
print(code_dict['code'])

多次异步加载

下一次的异步加载的请求数据时上一次异步加载请求到的数据

解决方法，注意看timeline图，看请求的先后顺序，构造请求，上一次请求完成之后再请求另一条。

分析页面http://exercise.kingname.info/exercise_ajax_3.html的timeline我们可以发现，请求顺序如下:

exercise_ajax_3.html ==> loaddata_3.js==>ajax_backend=>ajax_3_postbackend

按照这个顺序去找线索

分析exercise_ajax_3.html中我们可以发现两条线索。
它是一个get请求
在返回源代码中有特殊的信息

 <script> var secret_2 = 'kingname';</script>

分析loaddata_3.js
它是一个get请求.
请求的参数如下:其中去获取了一个code
返回了一个ajax请求

{name: "xx", age: 24, secret1: JSON.parse(get_data)['code'], secret2: secret_2}

分析ajax_backend
get请求
返回信息

{"code": "kingname is genius.", "success": true}

分析ajax_3_postbackend
post请求
请求参数

{"name":"xx","age":24,"secret1":"kingname is genius.","secret2":"kingname"}

返回信息:最终信息

{"code": "\u884c\u52a8\u4ee3\u53f7\uff1a\u54ce\u54df\u4e0d\u9519\u54e6", "success": true}

最终推出： exercise_ajax_3.html + ajax_backend =>ajax_3_postbackend => 最终信息

import requests
import re
import json


def query(url, action, params):
    if action == "get":
        html = requests.get(url).content.decode()

    if action == "post":
        html = requests.post(url, json=params).content.decode()

    return html


html_content1 = query('http://exercise.kingname.info/exercise_ajax_3.html', "get", '')
secret2 = re.search("secret_2 = '(.*?)'", html_content1, re.S).group(1)

html_content2 = query('http://exercise.kingname.info/ajax_3_backend', 'get', '')
secret1 = json.loads(html_content2)['code']

query_params = {"name": "xx", "age": 24, "secret1": secret1, "secret2": secret2}
html_content3 = query('http://exercise.kingname.info/ajax_3_postbackend', 'post', query_params)
code = json.loads(html_content3)['code']
print(code)
# 行动代号：哎哟不错哦

请求头 Headers

Headers称为请求头，浏览器可以将一些信息通过Headers传递给服务器，服务器也可以将一些信息通过Headers传递给浏览器。比如Cookies

将Request Headers复制过来,作为headers参数

import requests
import json


def query(url, action, params={}, query_header={}):
    if action == "get":
        html = requests(url, json=params, headers=query_header).content.decode()

    if action == "post":
        html = requests.post(url, json=params, headers=query_header).content.decode()

    return html


# 复制过来的Request Headers
headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh,en-US;q=0.9,en;q=0.8,zh-CN;q=0.7',
    'anhao': 'kingname',
    'Connection': 'keep-alive',
    'Content-Type': 'application/json; charset=utf-8',
    'Cookie': '__cfduid=d8123cbb54db1bb4ecea33258bb2fbff01567848717',
    'Host': 'exercise.kingname.info',
    'Referer': 'http://exercise.kingname.info/exercise_headers.html',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest'
}
html_content = query('http://exercise.kingname.info/exercise_headers_backend', 'get', '', headers)
print(json.loads(html_content))

高级ajax请求

含有特殊字符串用于验证身份，称为Token,Token一直在变化。

相应的算法在js文件中，但是此类文件一半经过了代码混淆
使用Selenium,源代码不可见，但是控制台开发工具可见

Selenium

网页自动化测试工具，可以通过代码来操作网页上的各个元素

安装

如下，pycharm将会自动安装

import selenium

下载ChromeDriver 浏览器驱动程序

根据自己google浏览器的版本号去查找对应的chromedriver下载

chrome://version/

http://chromedriver.storage.googleapis.com/index.html

将chromedriver与要模拟浏览器的代码放在一起

保证有Google Chrome浏览器

执行以下代码

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


# 指定用chromedriver来解析网页
driver = webdriver.Chrome('./chromedriver_win32/chromedriver.exe')
driver.get('http://exercise.kingname.info/exercise_advanced_ajax.html')

try:
    WebDriverWait(driver, 30).until(EC.text_to_be_present_in_element((By.CLASS_NAME, 'content'), '通关'))
except Exception as _:
    print("请求超时")

element = driver.find_element_by_xpath('//div[@class="content"]')
print(f"异步加载的数据=== {element.text}")
driver.quit()

语法

等待时长

WebDriverWait(driver, 30)

等待30s，每0.5s检查一次

定位

presence_of_element_located 通过某个元素定位

参数：元祖，元祖的第0项为By.XX, 第1项为具体内容
text_to_be_present_in_element html元素的text文本中含使用某个文字

参数：第一个参数为元祖，元祖的第0项是By.XX, 第1项是具体标签内容；第2个参数为部分或全部文本，又或者是一段正则表达式

标志符

By.CLASS_NAME 通过某个class匹配
By.ID 通过id匹配
By.NAME 通过name匹配
By.XPath 通过XPath匹配

查找元素

使用BS4语法

element = driver.find_element_by_id("passwd_id") #如果有多个符合条件的，返回第一个
element = driver.find_element_by_name("passwd") # 如果有多个符合条件的，返回第一个
element_list = driver.find_elements_by_id("passwd-id") #以列表的形式返回所有的符合条件elements
element_list = driver.find_elements_by_name("passed") # 以列表的形式返回所有的符合条件element

使用XPath语法

element = driver.find_elment_by_xpath("//input[@id='passwd-id']")
# 如果有多个符合条件的，返回第一个
elements = driver.find_elements_by_xpath("//div[@id='passwd-id']")
# 以列表形式返回所有的符合条件的element

其他

find_element找到的是一个Elements对象，或者由一个Elements对象组成的列表
find_element 找不到元素将会抛出一个异常，所以，如果不确定元素是否存在，那么必须使用"try...except Exception:" 把find_elment_by_xx 包起来
使用by_xpath时，如果要读取标签里面的文本，不能直接使用text()访问，只能返回Element对象的.text属性

comment = driver.find_element_by_xpath("//div[@class='content']").text

实例：抓取乐视网视频评论信息

抓取乐视网视频评论信息