前言

爬虫常见的有以下几种

通过接口抓取
通过页面抓取

通过页面抓取的有只能爬静态页面的，当然也有可以模拟登录的。

这里说的selenium-java就是使用浏览器访问，来爬取数据，所以可以做所有浏览器的行为，比如模拟登录。

它的原理就是使用驱动打开本地的浏览器并建立连接，浏览器进程和调用进程不在同一进程中，通过给浏览器各种指令获取和处理数据。

添加依赖

<dependencies>
    <!-- Selenium WebDriver核心依赖 -->
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>4.15.0</version>
    </dependency>

    <!-- WebDriverManager - 自动管理浏览器驱动 -->
    <dependency>
        <groupId>io.github.bonigarcia</groupId>
        <artifactId>webdrivermanager</artifactId>
        <version>6.1.0</version>
    </dependency>

    <dependency>
        <groupId>ch.qos.logback</groupId>
        <artifactId>logback-classic</artifactId>
        <version>1.5.13</version>
    </dependency>
</dependencies>

爬取

基本示例

import io.github.bonigarcia.wdm.WebDriverManager;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.chromium.ChromiumDriver;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

import java.net.MalformedURLException;
import java.net.URI;
import java.time.Duration;
import java.util.List;

public class ZPageProcessor {

    public static void main(String[] args) throws MalformedURLException {
        // 自动下载并配置 ChromeDriver
        WebDriverManager
                .chromedriver()
                .driverRepositoryUrl(URI.create("https://registry.npmmirror.com/-/binary/chromedriver").toURL())
                .setup();
        // 配置Chrome选项（无头模式可选）
        ChromeOptions options = new ChromeOptions();
//        options.addArguments("--headless");

        // 创建WebDriver实例
        try {
            ChromiumDriver driver = new ChromeDriver(options);
            driver.get("https://www.psvmc.cn/");
            List<WebElement> postTitleList = driver.findElements(By.cssSelector("a.post-title-link"));
            Thread.sleep(5000);
            for (WebElement postTitle : postTitleList) {
                String postTitleText = postTitle.getText();
                System.out.println("postTitleText："+postTitleText);
            }

            Thread.sleep(5000);
            driver.quit();// 关闭浏览器
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

登录示例

driver.get("https://schooltest.xhkjedu.com/#/login");

WebElement usernameField = driver.findElement(By.cssSelector("input[placeholder=\"请输入账号\"]"));
WebElement passwordField = driver.findElement(By.cssSelector("input[placeholder=\"请输入密码\"]"));

usernameField.sendKeys("username");
passwordField.sendKeys("userpwd");
List<WebElement> buttonList = driver.findElements(By.cssSelector("button"));
if (!buttonList.isEmpty()) {
    for (WebElement button : buttonList) {
        if (button.getText().equals("登 录")) {
            button.click();
        }
    }
}

// 等待页面加载（根据实际情况调整）
Thread.sleep(5000);

常用API

自动下载配置驱动

这里使用了镜像来下载驱动。

注意驱动不是浏览器，而是连接浏览器的桥梁。

// 自动下载并配置 ChromeDriver
WebDriverManager
        .chromedriver()
        .useMirror()
        .setup();

设置并加载页面

// 配置Chrome选项（无头模式可选）
ChromeOptions options = new ChromeOptions();
//options.addArguments("--headless=new"); // Chrome 109+ 的新无头模式
options.addArguments("--disable-gpu");
options.addArguments("--window-size=1280,720");
options.addArguments("--no-sandbox");
options.addArguments("--disable-dev-shm-usage");
ChromiumDriver driver = new ChromeDriver(options);
// 1. 打开登录页面
driver.get("https://www.psvmc.cn");

其中options.addArguments("--headless");是配置浏览器不可见。

开发过程中可以让浏览器显示出来，正式使用的时候可以隐藏。

获取元素

通过Name

1	WebElement usernameField = driver.findElement(By.name("username"));

通过样式

1	List<WebElement> postTitleList = driver.findElements(By.cssSelector("a.post-title-link"));

通过placeholder

1	WebElement usernameField = driver.findElement(By.cssSelector("input[placeholder=\"请输入账号\"]"));

通过Type获取

1	WebElement submitButton = driver.findElement(By.cssSelector("button[type=submit]"));

等待元素

WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(2));
List<WebElement> postTitleList = wait.until(
        ExpectedConditions.presenceOfAllElementsLocatedBy(By.cssSelector("a.post-title-link"))
);

获取内容

List<WebElement> postTitleList = driver.findElements(By.cssSelector("a.post-title-link"));
Thread.sleep(5000);
for (WebElement postTitle : postTitleList) {
    String postTitleText = postTitle.getText();
    System.out.println("postTitleText："+postTitleText);
}

注意

如果页面有JS动画，控制页面的内容，我们需要等待一定时间让页面渲染出来。

退出驱动

1	driver.quit();// 关闭浏览器

截图

// 执行截图操作
TakesScreenshot ts = (TakesScreenshot) driver;
File source = ts.getScreenshotAs(OutputType.FILE);

// 定义保存截图的目标文件路径
File destination = new File("D:\\selenium\\screenshot.png");

// 将截图文件从临时位置复制到目标位置
FileUtils.copyFile(source, destination);

默认截图的大小受系统缩放比例的影响，所以截图的大小会是原始尺寸 * 缩放比例。

为了不受缩放比例的影响可以设置：

1
2
3

ChromeOptions options = new ChromeOptions();
options.addArguments("--window-size=1280,720");
options.addArguments("--force-device-scale-factor=1");

获取元素的宽度

1
2
3

WebElement baseElement = driver.findElement(By.tagName("body"));
int width = baseElement.getSize().getWidth();
System.out.println("width:" + width);

获取位置

WebElement baseElement = driver.findElement(By.tagName("body"));
System.out.println("元素位置: " + baseElement.getLocation());
System.out.println("元素大小: " + baseElement.getSize());
System.out.println("元素是否可见: " + baseElement.isDisplayed());

其中

getLocation() 是返回的相对于浏览器窗口左上角 的位置。

模拟移动与点击

相对元素位置

// 定位一个基础元素（这里以页面的 body 元素为例）
WebElement baseElement = driver.findElement(By.tagName("body"));
// 创建 Actions 对象
Actions actions = new Actions(driver);

// 定义要点击的坐标（相对于基础元素）
int xOffset = 632;
int yOffset = 420;

// 移动鼠标到指定坐标并点击
try {
    actions.moveToElement(baseElement, xOffset, yOffset).click().perform();
} catch (Exception e) {
    e.printStackTrace(); // 打印详细堆栈信息
}

注意

相对元素移动的时候，移动的位置对于元素来说必须是可见的。如果上面有个按钮挡着了是会报错的。

方法说明

moveToElement(WebElement target, int xOffset, int yOffset) 方法用于将鼠标移动到指定元素的特定位置。

这个方法的参数含义如下：

参数详解

参数	含义
`target`	目标元素（WebElement 对象），鼠标将以该元素为基准进行定位。
`xOffset`	水平偏移量（像素），相对于元素左上角的 X 坐标。正值向右，负值向左。
`yOffset`	垂直偏移量（像素），相对于元素左上角的 Y 坐标。正值向下，负值向上。

坐标系统说明

原点：元素的左上角 (0, 0)
正方向：向右和向下为正

相对于窗口位置

// 创建 Actions 对象
Actions actions = new Actions(driver);
// 定义要点击的坐标（相对于基础元素）
int xOffset = 632;
int yOffset = 420;

// 移动鼠标到指定坐标并点击
try {
    actions.moveByOffset(xOffset, yOffset).click().perform();
} catch (Exception e) {
    e.printStackTrace(); // 打印详细堆栈信息
}

这个方法允许你从当前鼠标位置开始，按照指定的像素偏移量移动鼠标。

方法介绍

1 2	Actions actions = new Actions(driver); actions.moveByOffset(xOffset, yOffset).perform();

参数

参数	含义
`xOffset`	水平偏移量（像素），正值向右，负值向左。
`yOffset`	垂直偏移量（像素），正值向下，负值向上。