Skip to content

Conversation

@zstack-robot-1
Copy link
Collaborator

DBImpact

Resolves: ZSTAC-1111

Change-Id: I61797062696e7379696b6e6e70796b6367706a6f

sync from gitlab !8967

DBImpact

Resolves: ZSTAC-1111

Change-Id: I61797062696e7379696b6e6e70796b6367706a6f
@coderabbitai
Copy link

coderabbitai bot commented Dec 29, 2025

工作流

本次更改引入了GPU PCI地址映射系统,包含新的Spring服务、数据库架构、JPA实体和配置文档。该系统管理VM与主机PCI地址之间的映射,具有内存缓存、TTL过期机制和批量数据库查询支持。

更改

内聚组 / 文件 变更总结
数据库架构
conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql
创建新的VmGpuPciMappingVO表,包含主键、唯一键约束、外键关系和ON DELETE CASCADE级联规则。支持UUID、vmInstanceUuid、vmPciAddress、hostPciAddress、gpuSerial字段以及时间戳。
JPA实体
header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO.java, header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO_.java
添加VmGpuPciMappingVO实体类和相应的Criteria元模型类,用于类型安全的数据库查询和映射管理。包含@entity@table@foreignkey等注解和@PreUpdate生命周期钩子。
核心服务
compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java
新增Spring服务,实现带TTL(5分钟)的并发映射缓存、批量检索、单条创建/删除、预加载、过期清理和缓存统计功能。集成数据库操作和生命周期管理。
文档与指南
.github/copilot-instructions.md, gpu-pci-address-implementation-guide.md, prometheus-integration-flow.md, zstack-gpu-pci-monitoring-data-flow.md
添加四份文档文件:Copilot AI助手指令、GPU PCI地址映射设计指南、Prometheus集成流程说明和ZStack GPU PCI监控数据流详解。包含架构概览、集成点、数据流图和性能优化建议。

序列图

sequenceDiagram
    participant Agent as VM Agent
    participant API as API Layer
    participant Cache as VmGpuPciMappingService
    participant DB as Database
    participant PG as Prometheus/<br/>PushGateway
    
    Agent->>API: POST /zwatch/metrics (GPU指标)
    activate API
    API->>API: APIPutMetricDataMsg处理
    API->>Cache: getHostPciAddressesBatch(cacheKeys)
    activate Cache
    alt 缓存命中
        Cache-->>API: 返回已缓存的PCI地址
    else 缓存过期/未命中
        Cache->>DB: batchQueryFromDatabase()
        DB-->>Cache: 返回vmPciAddress→hostPciAddress映射
        Cache->>Cache: 更新缓存和时间戳
        Cache-->>API: 返回映射结果
    end
    deactivate Cache
    
    API->>API: 添加PciDeviceAddressOnHost标签<br/>(动态标签增强)
    API->>DB: DatabaseDriver.write()
    activate DB
    DB-->>PG: 推送至PushGateway
    deactivate DB
    API-->>Agent: 响应成功
    deactivate API
    
    PG->>PG: 存储至TSDB<br/>含有丰富标签
Loading
sequenceDiagram
    participant Client as Client
    participant ZQL as ZQL Parser
    participant VmNS as VmNamespace
    participant MappingSvc as VmGpuPciMappingService
    participant TSDB as Prometheus TSDB
    
    Client->>ZQL: 查询 by hostPciAddress
    activate ZQL
    ZQL->>VmNS: 执行查询
    activate VmNS
    VmNS->>VmNS: 识别GPU指标
    VmNS->>MappingSvc: getHostPciAddressesBatch()
    activate MappingSvc
    alt 缓存有效(<1ms)
        MappingSvc-->>VmNS: 返回缓存的PCI映射
    else 需要数据库查询(~5-10ms)
        MappingSvc->>TSDB: 批量查询映射记录
        TSDB-->>MappingSvc: 返回结果并更新缓存
        MappingSvc-->>VmNS: 返回最新映射
    end
    deactivate MappingSvc
    VmNS->>VmNS: 按PciDeviceAddressOnHost过滤
    VmNS-->>ZQL: 返回增强后的数据
    deactivate VmNS
    ZQL-->>Client: 查询结果
    deactivate ZQL
Loading

代码审查工作量估计

🎯 3 (中等) | ⏱️ ~20 分钟

诗句

🐰 GPU的地址在哪里,映射表告诉你秘密!
缓存闪闪五分钟,批查库快如闪电,
Prometheus眼睛亮,标签丰富数据香,
PCI来回映射,ZStack监控真强!🚀✨

Pre-merge checks and finishing touches

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)
Check name Status Explanation Resolution
Title check ❌ Error 标题'[root]: test gpu'与变更集内容不符,变更涉及GPU PCI地址映射完整实现(数据模型、服务、数据库表、文档),但标题过于宽泛且描述不准确。 建议将标题更改为更具体地反映主要变更,如'[vm]: Add GPU PCI address mapping service and schema',确保清晰表达变更内容。
Docstring Coverage ⚠️ Warning Docstring coverage is 40.91% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed PR描述包含JIRA工单号(ZSTAC-1111)和同步来源说明,与变更集相关,虽然信息简洁但足以说明变更的来源和关联工单。
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sync/jin.shen/fix-80788-test

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

📜 Review details

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca54e2b and e51869f.

📒 Files selected for processing (8)
  • .github/copilot-instructions.md
  • compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java
  • conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql
  • gpu-pci-address-implementation-guide.md
  • header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO.java
  • header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO_.java
  • prometheus-integration-flow.md
  • zstack-gpu-pci-monitoring-data-flow.md
🧰 Additional context used
📓 Path-based instructions (3)
**/*.*

⚙️ CodeRabbit configuration file

**/*.*: - 代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写

Files:

  • prometheus-integration-flow.md
  • header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO_.java
  • conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql
  • header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO.java
  • compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java
  • zstack-gpu-pci-monitoring-data-flow.md
  • gpu-pci-address-implementation-guide.md
**/*.java

⚙️ CodeRabbit configuration file

**/*.java: ## 1. API 设计要求

  • API 命名:
    • API 名称必须唯一,不能重复。
    • API 消息类需要继承 APIMessage;其返回类必须继承 APIReplyAPIEvent,并在注释中用 @RestResponse 进行标注。
    • API 消息上必须添加注解 @RestRequest,并满足如下规范:
      • path:
        • 针对资源使用复数形式。
        • 当 path 中引用消息类变量时,使用 {variableName} 格式。
      • HTTP 方法对应:
        • 查询操作 → HttpMethod.GET
        • 更新操作 → HttpMethod.PUT
        • 创建操作 → HttpMethod.POST
        • 删除操作 → HttpMethod.DELETE
    • API 类需要实现 __example__ 方法以便生成 API 文档,并确保生成对应的 Groovy API Template 与 API Markdown 文件。

2. 命名与格式规范

  • 类名:

    • 使用 UpperCamelCase 风格。
    • 特殊情况:
      • VO/AO/EO 类型类除外。
      • 抽象类采用 AbstractBase 前缀/后缀。
      • 异常类应以 Exception 结尾。
      • 测试类需要以 TestCase 结尾。
  • 方法名、参数名、成员变量和局部变量:

    • 使用 lowerCamelCase 风格。
  • 常量命名:

    • 全部大写,使用下划线分隔单词。
    • 要求表达清楚,避免使用含糊或不准确的名称。
  • 包名:

    • 统一使用小写,使用点分隔符,每个部分应是一个具有自然语义的英文单词(参考 Spring 框架的结构)。
  • 命名细节:

    • 避免在父子类或同一代码块中出现相同名字的成员或局部变量,防止混淆。
    • 命名缩写:
      • 不允许使用不必要的缩写,如:AbsSchedulerJobcondiFu 等。应使用完整单词提升可读性。

3. 编写自解释代码

  • 意图表达:

    • 避免使用布尔型参数造成含义不明确。例如:
      • 对于 stopAgent(boolean ignoreError),建议拆分为不同函数(如 stopAgentIgnoreError()),或使用枚举表达操作类型。
    • 命名应尽量用完整的单词组合表达意图,并在名称中体现数据类型或用途(例如在常量与变量名称中,将类型词放在末尾)。
    • 避免使用魔法值(Magic Value):
      • 直接使用未经定义的数值或字符串(如 if (status == 5))应替换为枚举或常量。

      • 示例:

      • // 错误示例:魔法值

      • if (user.getStatus() == 5) { ... }

      • // 正确示例:常量或枚举

      • public static final int STATUS_ACTIVE = 5;

      • if (user.getStatus() == STATUS_ACTIVE) { ... }

      • // 或使用枚举

      • enum UserStatus { ACTIVE, INACTIVE }

  • 注释:

    • 代码应尽量做到自解释,对少于两行的说明可以直接写在代码中。
    • 对于较长的注释,需要仔细校对并随代码更新,确保内容正确。
    • 接口方法不应有多余的修饰符(例如 public),且必须配有有效的 Javadoc 注释。

4. 流程控制和结构优化

  • if...else 的使用:

    • 应尽量减少 if...else 结构的使用,建议:
      • 限制嵌套层级最多为两层,且内层不应再出现 else 分支。
      • 尽早返回(Early Return),将条件判断中的处理逻辑提前结束或抽成独立方法。
      • 使用 Java Stream 或 Lambda 表达式代替冗长的循环与条件判断。
  • 条件判断:

    • if 条件表达不宜过长或过于复杂,必要时可以将条件抽成 boolean 变量描述。
  • 代码块长度:
    ...

Files:

  • header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO_.java
  • header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO.java
  • compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java
**/*.sql

⚙️ CodeRabbit configuration file

**/*.sql: - Review the SQL code, make sure has no errors and confirm that:

  • Upgrading scene has been carefully handled
  • Do not use DEFAULT 0000-00-00 00:00:00 , use DEFAULT CURRENT_TIMESTAMP instead
  • When NOT NULL exists, must use stored procedure or other functions to process historical data, this is very very important
  • 数据库记录中,如果字符串长度不可控,不要用 vchar,用 text 类型

Files:

  • conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql
🧠 Learnings (16)
📚 Learning: 2025-08-04T04:48:19.103Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2360
File: network/src/main/java/org/zstack/network/l3/L3BasicNetwork.java:449-490
Timestamp: 2025-08-04T04:48:19.103Z
Learning: ZStack项目在cherry-pick操作中,即使发现了性能优化机会(如IP地址批量保存的内存优化),也严格遵循不做额外修改的政策,优先保证cherry-pick的完整性和一致性。

Applied to files:

  • prometheus-integration-flow.md
📚 Learning: 2025-08-24T06:33:10.771Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2496
File: storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotTreeBase.java:1218-1224
Timestamp: 2025-08-24T06:33:10.771Z
Learning: ZStack团队在容量管理相关问题上响应迅速,当发现Pull快照流程中申请容量与释放容量不匹配的问题时,开发人员会及时进行修复以确保主存储容量核算的准确性。

Applied to files:

  • prometheus-integration-flow.md
📚 Learning: 2025-08-04T04:48:53.756Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2360
File: header/src/main/java/org/zstack/header/network/l3/APIDeleteReservedIpRangeMsg.java:59-63
Timestamp: 2025-08-04T04:48:53.756Z
Learning: 在ZStack项目中,API的Markdown文档文件不在主代码仓库内维护,而是在单独的文档仓库或位置进行管理。尽管编码规范提到需要生成对应的Groovy API Template与API Markdown文件,但Markdown文档不在当前代码仓库中。

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-08-04T03:19:07.454Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2360
File: compute/src/main/java/org/zstack/compute/vm/StaticIpOperator.java:364-501
Timestamp: 2025-08-04T03:19:07.454Z
Learning: In ZStack project, developers may prefer to keep related validation logic together in a single method (like validateStaticIpTagsInApiMessage with 130+ lines) when the architecture is clear and there's no code reuse, rather than splitting into smaller methods based purely on single responsibility principle.

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-08-22T05:36:40.467Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2489
File: storage/src/main/java/org/zstack/storage/snapshot/VolumeTree.java:471-545
Timestamp: 2025-08-22T05:36:40.467Z
Learning: In ZStack项目中,对于核心逻辑相关的代码,团队倾向于暂时不进行重构,即使代码复杂度较高,优先保证核心功能的稳定性。

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-08-06T05:53:38.558Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2380
File: storage/src/main/java/org/zstack/storage/snapshot/VolumeSnapshotManagerImpl.java:1232-1232
Timestamp: 2025-08-06T05:53:38.558Z
Learning: 在ZStack项目中,代码注释应该准确反映实际实现。如果代码只查询特定类型的归档数据(如ArchiveVmNicBundle),注释不应该说成"所有设备地址归档",而应该具体说明查询的是哪种类型的归档。

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-06-19T09:08:57.046Z
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2217
File: header/src/main/java/org/zstack/header/network/l2/L2NetworkCandidateFilterExtensionPoint.java:7-10
Timestamp: 2025-06-19T09:08:57.046Z
Learning: In the zstack project, for simple and self-explanatory interfaces like L2NetworkCandidateFilterExtensionPoint, the team prefers to rely on clear method names and parameters rather than adding detailed Javadoc comments, even when coding guidelines suggest otherwise.

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-10-20T11:50:42.825Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2763
File: sdk/src/main/java/org/zstack/sdk/BatchCreateHostKernelInterfaceAction.java:52-54
Timestamp: 2025-10-20T11:50:42.825Z
Learning: 在 MatheMatrix/zstack 仓库中,sdk/src/main/java/org/zstack/sdk/ 目录下的文件是自动生成的内容,不应进行手动修改。如果需要修复这些文件中的问题,应该在代码生成器层面进行调整。

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-07-10T08:46:14.028Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2288
File: header/src/main/java/org/zstack/header/network/l2/L2NetworkOwnedL3ExtensionPoint.java:1-4
Timestamp: 2025-07-10T08:46:14.028Z
Learning: 在 ZStack 项目中,Java 文件不使用许可证头部注释,文件直接以 package 声明开始,这是项目的标准做法。

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-08-13T02:31:40.023Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2419
File: plugin/sdnController/src/main/java/org/zstack/sdnController/SdnControllerBase.java:986-1067
Timestamp: 2025-08-13T02:31:40.023Z
Learning: 在ZStack项目中,团队倾向于保持统一的业务流程方法不进行拆分,即使方法较长也会维持在单个方法中以保持业务逻辑的完整性。

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-07-24T05:53:10.246Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2325
File: compute/src/main/java/org/zstack/compute/vm/VmMigrateCallExtensionFlow.java:29-29
Timestamp: 2025-07-24T05:53:10.246Z
Learning: 在ZStack项目的重构过程中,团队采用务实的优先级策略:优先完成影响运行时功能的代码重构,对于注释中的旧引用等非功能性问题可以延后处理,因为这些不会影响系统的实际运行。

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-10-29T10:12:05.812Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2829
File: sdk/src/main/java/org/zstack/sdk/CandidateDecisionEntry.java:7-29
Timestamp: 2025-10-29T10:12:05.812Z
Learning: Files in the `sdk/src/main/java/org/zstack/sdk/` directory are auto-generated and should not be manually modified. Code style and encapsulation issues in these files should be ignored during reviews.

Applied to files:

  • .github/copilot-instructions.md
📚 Learning: 2025-07-22T02:30:46.123Z
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2311
File: plugin/sdnController/src/main/java/org/zstack/sdnController/hardwareVxlan/KVMRealizeHardwareVxlanNetworkBackend.java:52-53
Timestamp: 2025-07-22T02:30:46.123Z
Learning: In ZStack, DatabaseFacade.findByUuid() calls for VO objects in contexts like KVMRealizeHardwareVxlanNetworkBackend do not require null checks because the L2NetworkInventory passed to these methods is guaranteed to have a corresponding VO record in the database by the system's design and calling patterns.

Applied to files:

  • header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO_.java
📚 Learning: 2025-08-03T04:10:21.683Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 0
File: :0-0
Timestamp: 2025-08-03T04:10:21.683Z
Learning: ZStack 数据库升级脚本 V4.10.16__schema.sql 中的 UPGRADE_VM_METADATA_TABLES_DIRECT 存储过程使用直接 RENAME TABLE 操作,不具备幂等性。需要通过 information_schema 检查表和约束的存在性来确保脚本可以安全重复执行。

Applied to files:

  • conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql
📚 Learning: 2025-08-03T04:10:21.683Z
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 0
File: :0-0
Timestamp: 2025-08-03T04:10:21.683Z
Learning: 在 ZStack 数据库升级脚本中,直接使用 RENAME TABLE 不能保证幂等性。应该通过 information_schema.tables 检查表的存在性,只在源表存在且目标表不存在时才执行重命名操作,以确保升级脚本可以安全地重复执行。

Applied to files:

  • conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql
📚 Learning: 2025-08-03T03:42:34.349Z
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2293
File: conf/db/upgrade/V4.10.16__schema.sql:69-82
Timestamp: 2025-08-03T03:42:34.349Z
Learning: 在 ZStack 项目中进行数据库 DDL 升级时,由于 MySQL/MariaDB 中的 RENAME TABLE、ALTER TABLE 等 DDL 语句会隐式 COMMIT,因此应使用临时表 + 数据复制的方式来确保升级可逆,而不是依赖事务的 ROLLBACK 机制。

Applied to files:

  • conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql
🧬 Code graph analysis (2)
header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO.java (2)
header/src/main/java/org/zstack/header/storage/primary/PrimaryStorageHistoricalUsageBaseVO.java (1)
  • org (11-45)
core/src/main/java/org/zstack/core/upgrade/AgentVersionVO.java (1)
  • Table (10-87)
compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java (2)
core/src/main/java/org/zstack/core/Platform.java (1)
  • Platform (70-1208)
core/src/main/java/org/zstack/core/db/Q.java (1)
  • Q (16-193)
🪛 LanguageTool
prometheus-integration-flow.md

[uncategorized] ~217-~217: 动词的修饰一般为‘形容词(副词)+地+动词’。您的意思是否是:灵活"地"推
Context: ...icTemplateMetricDataHttpReceiver`配置灵活的推送策略 --- *本文档基于ZStack代码库分析生成,版本: 5.4.0...

(wb4)

zstack-gpu-pci-monitoring-data-flow.md

[grammar] ~295-~295: Woord verwijderen
Context: ...- 内存缓存:~0.1ms响应 vs ~5-10ms DB查询 - TTL过期:5分钟自动清理,防止内存泄漏 ### 扩展性设计 - 插件化架构:DatabaseDriver接口支持不同存储...

(QB_NEW_NL_OTHER_ERROR_IDS_UNNECESSARY_NOUN)

🪛 markdownlint-cli2 (0.18.1)
gpu-pci-address-implementation-guide.md

175-175: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


182-182: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


190-190: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


493-493: Multiple headings with the same content

(MD024, no-duplicate-heading)

🔇 Additional comments (8)
.github/copilot-instructions.md (1)

1-42: LGTM!

该文件为 AI 编码助手提供了清晰的 ZStack 架构概览和开发规范指导,内容结构合理,使用英文编写符合编码规范要求。

prometheus-integration-flow.md (1)

1-221: LGTM!

该文档详细描述了 ZStack 与 Prometheus 的集成流程,包括数据收集(Pull 模式)和推送(Push 模式)两个主要流程。文档结构清晰,包含架构图、时序图和代码示例,有助于理解监控系统的设计。

conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql (1)

1-48: SQL 升级脚本设计良好

脚本使用了存储过程和幂等性检查,符合 ZStack 项目的最佳实践。正确使用了 DEFAULT CURRENT_TIMESTAMP 而非 DEFAULT 0000-00-00 00:00:00,并且通过外键约束实现了级联删除。

header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO_.java (1)

1-15: LGTM!

JPA 元模型类定义正确,所有字段与 VmGpuPciMappingVO 实体类匹配,符合 JPA 规范要求。

zstack-gpu-pci-monitoring-data-flow.md (1)

1-306: LGTM!

该文档详细描述了 ZStack GPU PCI 地址映射的监控数据流,包含完整的架构图、时序图、数据转换流程和性能优化方案。文档结构清晰,配合 mermaid 图表使设计思路易于理解。

compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java (1)

17-32: 缓存设计良好,实现了性能优化

该服务使用了 TTL-based 的缓存策略,结合 @PostConstruct 预加载机制,能够有效减少数据库查询压力。批量查询接口的设计也符合性能优化最佳实践。

Also applies to: 212-228

gpu-pci-address-implementation-guide.md (1)

1-839: LGTM!

该实现指南文档非常详尽,涵盖了 GPU PCI 地址映射方案的各个方面:

  • 数据模型设计与 VO 类定义
  • 服务层接口与缓存优化
  • 查询层的动态标签增强
  • Prometheus 集成设计
  • 性能优化方案(批量查询、内存缓存、预热机制)
  • ZQL 查询支持和扩展性设计

文档结构清晰,包含了完整的代码示例和最佳实践建议,对开发和维护有很大帮助。

header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO.java (1)

32-33: 缺少 createDate 字段的初始化

createDate 字段定义了 @Column 注解,但没有设置默认值或在 @PrePersist 中初始化。相比之下,lastOpDate@PreUpdate 中有更新逻辑。建议添加 @PrePersist 方法来初始化 createDate

🔎 建议的修复
+    @PrePersist
+    private void prePersist() {
+        createDate = new Timestamp(System.currentTimeMillis());
+        lastOpDate = new Timestamp(System.currentTimeMillis());
+    }
+
     @PreUpdate
     private void preUpdate() {
         lastOpDate = new Timestamp(System.currentTimeMillis());
     }
⛔ Skipped due to learnings
Learnt from: zstack-robot-1
Repo: MatheMatrix/zstack PR: 2419
File: conf/db/upgrade/V5.3.46__schema.sql:105-116
Timestamp: 2025-08-12T03:27:18.648Z
Learning: In ZStack project, the timestamp pattern `createDate` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' combined with `lastOpDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP is the standard database schema usage and should not be flagged as an issue. This is an established project convention confirmed by project maintainers.
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2250
File: header/src/main/java/org/zstack/header/network/l3/ReservedIpRangeVO.java:50-53
Timestamp: 2025-07-07T02:24:14.223Z
Learning: 在 ZStack 代码库中,PreUpdate 回调方法中将 lastOpDate 设置为 null 是一个既定的设计模式,不需要修改为当前时间戳。这个模式在其他 VO 类中也有使用,应该遵循现有的设计。
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2419
File: conf/db/upgrade/V5.3.46__schema.sql:90-104
Timestamp: 2025-08-12T03:27:09.968Z
Learning: In ZStack database schema files, the pattern `createDate` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' is the established project standard and should not be changed to DEFAULT CURRENT_TIMESTAMP, despite general SQL guidelines suggesting otherwise. This is confirmed project convention.
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2428
File: conf/db/upgrade/V5.3.40__schema.sql:195-204
Timestamp: 2025-09-02T05:45:41.896Z
Learning: User MatheMatrix (ye.zou) confirmed that the createDate timestamp pattern with DEFAULT '0000-00-00 00:00:00' should not be changed in ZStack database schema files, reinforcing that this is the established project convention that takes precedence over general SQL guidelines.
Learnt from: ZStack-Robot
Repo: MatheMatrix/zstack PR: 2307
File: conf/db/upgrade/V5.4.0__schema.sql:25-25
Timestamp: 2025-07-21T01:53:16.584Z
Learning: In the ZStack project, all `createDate` timestamp columns in SQL schema files consistently use `DEFAULT '0000-00-00 00:00:00'` as the default value, not `DEFAULT CURRENT_TIMESTAMP`. This is a project-wide convention and should be followed for consistency.

@@ -0,0 +1,267 @@
package org.zstack.vm;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

包名错误

根据文件路径 compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java,包名应为 org.zstack.compute.vm,而非 org.zstack.vm

🔎 建议的修复
-package org.zstack.vm;
+package org.zstack.compute.vm;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
package org.zstack.vm;
package org.zstack.compute.vm;
🤖 Prompt for AI Agents
In compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java at
line 1 the package declaration is incorrect (package org.zstack.vm); change it
to package org.zstack.compute.vm to match the file path and Java package
conventions, update any relative imports or references in the file if needed,
and rebuild to ensure no package-related compilation errors.

Comment on lines +24 to +31
// 缓存配置
private static final long CACHE_EXPIRE_MS = 5 * 60 * 1000; // 5分钟缓存过期时间

// 缓存结构:key = vmUuid + ":" + vmPciAddress, value = hostPciAddress
private final Map<String, String> mappingCache = new ConcurrentHashMap<>();

// 缓存时间戳:key = cacheKey, value = timestamp
private final Map<String, Long> cacheTimestamp = new ConcurrentHashMap<>();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

代码中包含中文注释,违反编码规范

根据编码规范:"代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写"。所有注释应使用英文。

🔎 建议的修复示例
-    // 缓存配置
-    private static final long CACHE_EXPIRE_MS = 5 * 60 * 1000; // 5分钟缓存过期时间
+    // Cache configuration
+    private static final long CACHE_EXPIRE_MS = 5 * 60 * 1000; // Cache expires after 5 minutes

-    // 缓存结构:key = vmUuid + ":" + vmPciAddress, value = hostPciAddress
+    // Cache structure: key = vmUuid + ":" + vmPciAddress, value = hostPciAddress
     private final Map<String, String> mappingCache = new ConcurrentHashMap<>();

-    // 缓存时间戳:key = cacheKey, value = timestamp
+    // Cache timestamp: key = cacheKey, value = timestamp
     private final Map<String, Long> cacheTimestamp = new ConcurrentHashMap<>();

请将文件中所有中文注释(第 24、25、27、30、34、39、43、49、60、66、72、85、89、104、131、140、145、156、160、166、178、182、188、199、203、209、230、248、263 行等)替换为英文。

🤖 Prompt for AI Agents
In compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java
around lines 24-31 (and other listed lines:
34,39,43,49,60,66,72,85,89,104,131,140,145,156,160,166,178,182,188,199,203,209,230,248,263),
replace all Chinese comments with clear, correctly spelled English comments that
preserve the original meaning (e.g., "缓存配置" -> "Cache configuration", "缓存结构:key
= vmUuid + \":\" + vmPciAddress, value = hostPciAddress" -> "Cache structure:
key = vmUuid + \":\" + vmPciAddress, value = hostPciAddress", etc.); ensure no
Chinese characters remain in comments or error messages, keep comment formatting
and locations unchanged, and run a quick grep/IDE search to verify all
occurrences are converted before committing.

Comment on lines +36 to +57
public String getHostPciAddress(String vmUuid, String vmPciAddress) {
String cacheKey = vmUuid + ":" + vmPciAddress;

// 检查缓存是否过期
Long timestamp = cacheTimestamp.get(cacheKey);
if (timestamp != null &&
System.currentTimeMillis() - timestamp > CACHE_EXPIRE_MS) {
// 缓存过期,清理
mappingCache.remove(cacheKey);
cacheTimestamp.remove(cacheKey);
timestamp = null;
}

// 从缓存获取或查询数据库
return mappingCache.computeIfAbsent(cacheKey, key -> {
String hostAddress = queryFromDatabase(vmUuid, vmPciAddress);
if (hostAddress != null) {
cacheTimestamp.put(cacheKey, System.currentTimeMillis());
}
return hostAddress;
});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

缺少 queryFromDatabase 方法定义

第 51 行调用了 queryFromDatabase(vmUuid, vmPciAddress) 方法,但该方法在类中未定义。这会导致编译错误。

🔎 建议添加缺失的方法
/**
 * Query mapping from database
 */
private String queryFromDatabase(String vmUuid, String vmPciAddress) {
    VmGpuPciMappingVO mapping = Q.New(VmGpuPciMappingVO.class)
        .eq(VmGpuPciMappingVO_.vmInstanceUuid, vmUuid)
        .eq(VmGpuPciMappingVO_.vmPciAddress, vmPciAddress)
        .find();
    
    return mapping != null ? mapping.getHostPciAddress() : null;
}

Comment on lines +107 to +129
private Map<String, String> batchQueryFromDatabase(List<String> cacheKeys) {
// 解析cacheKeys为vmUuid和pciAddress
Set<String> vmUuids = cacheKeys.stream()
.map(key -> key.split(":")[0])
.collect(Collectors.toSet());

Set<String> pciAddresses = cacheKeys.stream()
.map(key -> key.split(":")[1])
.collect(Collectors.toSet());

// 批量查询数据库
List<VmGpuPciMappingVO> mappings = Q.New(VmGpuPciMappingVO.class)
.in(VmGpuPciMappingVO_.vmInstanceUuid, vmUuids)
.in(VmGpuPciMappingVO_.vmPciAddress, pciAddresses)
.list();

// 转换为Map
return mappings.stream()
.collect(Collectors.toMap(
vo -> vo.getVmInstanceUuid() + ":" + vo.getVmPciAddress(),
VmGpuPciMappingVO::getHostPciAddress
));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

批量查询存在潜在的数据不匹配问题

第 118-121 行的批量查询使用了两个 IN 子句分别查询 vmUuidspciAddresses,这会返回所有 vmUuid 在集合中 pciAddress 在集合中的记录,但不保证 vmUuid 和 pciAddress 是成对匹配的。

例如,如果 cacheKeys 包含 ["vm1:pci1", "vm2:pci2"],查询会返回 vm1+pci2vm2+pci1 的组合(如果存在),这是不正确的。

🔎 建议的修复方案

方案 1:使用多个单独的查询(适合小批量)

private Map<String, String> batchQueryFromDatabase(List<String> cacheKeys) {
    Map<String, String> result = new HashMap<>();
    
    for (String cacheKey : cacheKeys) {
        String[] parts = cacheKey.split(":");
        String vmUuid = parts[0];
        String vmPciAddress = parts[1];
        
        VmGpuPciMappingVO mapping = Q.New(VmGpuPciMappingVO.class)
            .eq(VmGpuPciMappingVO_.vmInstanceUuid, vmUuid)
            .eq(VmGpuPciMappingVO_.vmPciAddress, vmPciAddress)
            .find();
        
        if (mapping != null) {
            result.put(cacheKey, mapping.getHostPciAddress());
        }
    }
    
    return result;
}

方案 2:查询所有相关记录后在内存中过滤(适合大批量)

private Map<String, String> batchQueryFromDatabase(List<String> cacheKeys) {
    Set<String> vmUuids = cacheKeys.stream()
        .map(key -> key.split(":")[0])
        .collect(Collectors.toSet());
    
    List<VmGpuPciMappingVO> mappings = Q.New(VmGpuPciMappingVO.class)
        .in(VmGpuPciMappingVO_.vmInstanceUuid, vmUuids)
        .list();
    
    // Build a set of requested cache keys for filtering
    Set<String> requestedKeys = new HashSet<>(cacheKeys);
    
    return mappings.stream()
        .map(vo -> {
            String key = vo.getVmInstanceUuid() + ":" + vo.getVmPciAddress();
            return new AbstractMap.SimpleEntry<>(key, vo.getHostPciAddress());
        })
        .filter(entry -> requestedKeys.contains(entry.getKey()))
        .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
}
🤖 Prompt for AI Agents
In compute/src/main/java/org/zstack/compute/vm/VmGpuPciMappingService.java
around lines 107-129, the current batch query uses two IN filters which can
return mismatched vmUuid+pciAddress combinations; change to either (A) for small
batches loop per cacheKey and query by both vmInstanceUuid and vmPciAddress to
get exact matches, or (B, preferred for larger batches) query once by the set of
vmUuids only, build a Set of requested "vmUuid:vmPciAddress" cache keys, then
filter the returned mappings in-memory to keep only entries whose
vmUuid:vmPciAddress pair exists in the requested set and collect those into the
result map.

Comment on lines +23 to +26
`vmInstanceUuid` varchar(32) NOT NULL COMMENT 'VM实例UUID',
`vmPciAddress` varchar(32) NOT NULL COMMENT 'VM内部看到的PCI地址',
`hostPciAddress` varchar(32) NOT NULL COMMENT 'Host上真实的PCI地址',
`gpuSerial` varchar(128) DEFAULT NULL COMMENT 'GPU序列号',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

SQL 脚本中包含中文注释,违反编码规范

根据编码规范要求:"代码里不应当有有中文,包括报错、注释等都应当使用正确的、无拼写错误的英文来写"。数据库表和列的 COMMENT 也属于代码的一部分,应使用英文。

🔎 建议的修复
         CREATE TABLE `zstack`.`VmGpuPciMappingVO` (
             `uuid` varchar(32) NOT NULL,
-            `vmInstanceUuid` varchar(32) NOT NULL COMMENT 'VM实例UUID',
-            `vmPciAddress` varchar(32) NOT NULL COMMENT 'VM内部看到的PCI地址',
-            `hostPciAddress` varchar(32) NOT NULL COMMENT 'Host上真实的PCI地址',
-            `gpuSerial` varchar(128) DEFAULT NULL COMMENT 'GPU序列号',
+            `vmInstanceUuid` varchar(32) NOT NULL COMMENT 'VM instance UUID',
+            `vmPciAddress` varchar(32) NOT NULL COMMENT 'PCI address seen inside the VM',
+            `hostPciAddress` varchar(32) NOT NULL COMMENT 'Real PCI address on the host',
+            `gpuSerial` varchar(128) DEFAULT NULL COMMENT 'GPU serial number',
             `createDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
             `lastOpDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
             PRIMARY KEY (`uuid`),
             UNIQUE KEY `ukVmGpuPciMappingVO` (`vmInstanceUuid`, `vmPciAddress`),
             KEY `fkVmGpuPciMappingVOVmInstanceVO` (`vmInstanceUuid`),
             CONSTRAINT `fkVmGpuPciMappingVOVmInstanceVO` FOREIGN KEY (`vmInstanceUuid`) REFERENCES `zstack`.`VmInstanceVO` (`uuid`) ON DELETE CASCADE
-        ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='VM GPU PCI地址映射表';
+        ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='VM GPU PCI address mapping table';

Also applies to: 33-33

🤖 Prompt for AI Agents
In conf/db/upgrade/V5.5.0__addVmGpuPciMappingTable.sql around lines 23-26 (and
also line 33), the column COMMENTs are written in Chinese which violates the
codebase rule to use English; update each column COMMENT to clear, concise
English equivalents (e.g., "VM instance UUID", "PCI address seen inside VM",
"Host PCI address", "GPU serial number") and ensure any other Chinese comments
in the file are replaced with proper English text.

Comment on lines +13 to +17
@EntityGraph(
friends = {
@EntityGraph.Neighbour(type = VmInstanceVO.class, myField = "vmInstanceUuid", targetField = "uuid")
}
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

EntityGraph 关系定义可能有误

根据相关代码示例(如 PrimaryStorageHistoricalUsageBaseVO.java),当定义外键关系时应使用 parents 而非 friendsfriends 通常用于同级关系,而这里 VmGpuPciMappingVO 依赖于 VmInstanceVO,应该是父子关系。

🔎 建议的修复
 @EntityGraph(
-    friends = {
+    parents = {
         @EntityGraph.Neighbour(type = VmInstanceVO.class, myField = "vmInstanceUuid", targetField = "uuid")
     }
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@EntityGraph(
friends = {
@EntityGraph.Neighbour(type = VmInstanceVO.class, myField = "vmInstanceUuid", targetField = "uuid")
}
)
@EntityGraph(
parents = {
@EntityGraph.Neighbour(type = VmInstanceVO.class, myField = "vmInstanceUuid", targetField = "uuid")
}
)
🤖 Prompt for AI Agents
In header/src/main/java/org/zstack/header/vm/VmGpuPciMappingVO.java around lines
13 to 17, the EntityGraph relationship is declared with "friends" but this
mapping should be a parent relationship to VmInstanceVO; replace the friends
attribute with parents and declare the @EntityGraph.Neighbour entry under
parents (pointing myField="vmInstanceUuid" to targetField="uuid") so the VO
correctly models VmInstanceVO as its parent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants