Nemo_Python Selenium获取浏览器中的网咯请求响应_Nemo社区_LinkNemo

该文章投稿至Nemo社区 Python 板块复制链接

Python Selenium获取浏览器中的网咯请求响应

发布于 2024/01/18 10:43 1,607浏览 0回复 4,192字

使用Selenium模拟操作浏览器时，除了界面上展示的内容，有时候也需要关心一些浏览器中发送的浏览器请求，毕竟其中某些请求的结果数据并不会展示到界面上，但是又跟实际采集业务相关。

在高版本（4.x）的Selenium中可以直接开启性能日志即可：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from urllib.parse import urlparse, parse_qs
import json

# 设置Chrome的DesiredCapabilities以启用性能日志
caps = DesiredCapabilities.CHROME
caps['goog:loggingPrefs'] = {'performance': 'ALL'}

# 配置WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--enable-logging")
chrome_options.add_argument("--v=1")
chrome_options.add_experimental_option('w3c', False)
chrome_options.add_experimental_option('perfLoggingPrefs', {
    'enableNetwork': True,
})

# 启动Chrome浏览器
driver = webdriver.Chrome(desired_capabilities=caps, options=chrome_options)

# 访问页面
driver.get('http://example.com')

# 从当前URL中提取invite_id
current_url = driver.current_url
parsed_url = urlparse(current_url)
query_params = parse_qs(parsed_url.query)
invite_id = query_params.get('invite_id', [None])[0]  # 获取invite_id，或者None如果不存在

# 等待页面加载完成
# ... 省略其他代码 ...

# 读取性能日志
logs = driver.get_log('performance')

# 分析日志，寻找特定请求的响应
for entry in logs:
    log = json.loads(entry['message'])['message']
    if (
        log['method'] == 'Network.responseReceived' and
        'get_invite_info' in log['params']['response']['url'] and
        invite_id and
        f'invite_id={invite_id}' in log['params']['response']['url']
    ):
        # 获取响应体（需要使用CDP命令）
        request_id = log['params']['requestId']
        response_body = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': request_id})
        print(response_body['body'])  # 打印响应主体

# 清理
driver.quit()

不过正好我在用的是3.x版本的Selenium，上述并不可用，这里做了两种方案：一种是调用js模拟请求，一种是拦截js原生请求函数。

基于反爬考虑，最终采用了第二种。

原理很简单：

重写js底层的几个网络发送方法，注入监听代码，在浏览器调用相应请求时，调用原发送方法，并将响应结果保存到全局window对象中，后续直接从window对象中取出相应响应即可。

这里直接附上代码：

    def inject_response_listener(driver: WebDriver):
        """
        注入响应监听
        """
        driver.execute_script("""
            // 存储原始引用
            var originalFetch = window.fetch;
            var originalXHROpen = XMLHttpRequest.prototype.open;

            // 准备存储响应
            window.collectedResponses = {};

            // 覆盖 fetch 方法来监听响应
            window.fetch = function() {
                var fetchCall = originalFetch.apply(this, arguments);
                fetchCall.then(function(response) {
                    var clonedResponse = response.clone();
                    clonedResponse.text().then(function(body) {
                        window.collectedResponses[response.url] = body;
                    });
                });
                return fetchCall;
            };

            // 覆盖 XMLHttpRequest 的 open 方法来监听响应
            XMLHttpRequest.prototype.open = function() {
                this.addEventListener('load', function() {
                    if (this.readyState === 4) {
                        window.collectedResponses[this.responseURL] = this.responseText;
                    }
                });
                originalXHROpen.apply(this, arguments);
            };
            
            var originalXHRSend = XMLHttpRequest.prototype.send;
            XMLHttpRequest.prototype.send = function(body) {
                this.addEventListener('load', function() {
                    if (this.status >= 200 && this.status < 300) {
                        window.collectedResponses[this.responseURL] = this.responseText;
                    }
                });
                originalXHRSend.apply(this, arguments);
            };
        """)

    def get_response(driver: WebDriver, match_url):
        """
        获取浏览器中某个网络请求响应，必须搭配inject_response_listener一起使用
        """
        # 等待页面加载完成
        WebDriverWait(driver, 10).until(
            lambda d: d.execute_script('return document.readyState') == 'complete'
        )

        # 读取存储的响应
        response = driver.execute_script("""
            for (var url in window.collectedResponses) {
                if (url.includes(arguments[0])) {
                    return window.collectedResponses[url];
                }
            }
            return null;  // 如果没有找到匹配的URL，返回null
        """, match_url)
        return response

只需要在浏览器加载网络前，调用注入函数，等待浏览器加载完毕，即可调用获取响应。

本文标签
爬虫 python selenium

上一个文章：使用DrissionPage无头模式采集网页信息

下一个文章：解决pip安装库时提示UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 72: illegal multibyte sequence

点了个评

Nemo

最近回复

Python Selenium获取浏览器中的网咯请求响应

点击排行

没有找到这位爷的热门文章哦~

最新文章

使用DrissionPage无头模式采集网页信息

Python Selenium获取浏览器中的网咯请求响应

解决pip安装库时提示UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 72: illegal multibyte sequence

陶渊明诗集（收藏版）

Python print如何一行覆盖输出？

论性能过剩

单元测试编码规范

浅谈代码覆盖率

Java & Python 里的泛型

python Selenium 操作工具封装：反反爬虫+内存管理