AI 運算平台架構總覽

版本 v1.1 | 文件日期: 2025年8月7日

1. 高階主管摘要

為了保持在全球高科技製造業的領先地位,本文件旨在規劃一個安全、可擴展且高效能的自建 AI 運算平台。此平台將使我們能夠運用最先進的生成式 AI 技術,快速應用於改善生產良率、優化製程、強化知識管理等核心業務,而無需依賴外部雲端服務,確保我們最寶貴的製程資料與智慧財產權獲得最高等級的保護。本架構以業界開放標準與企業級方案為核心,旨在最大化投資報酬率並加速創新週期。

2. 全端技術堆疊 (Full Stack Architecture)

L4: Application & Use Case Layer (應用與業務場景層)

業務應用 (Business Applications)

智慧知識庫 (RAG on SOPs)
良率分析與製程優化
AOI 瑕疵分類

開發框架 (Development Frameworks)

AI/ML Frameworks (e.g., PyTorch)
LLM Application Frameworks (e.g., LangChain)

L3: AI Platform & Management Layer (AI 平台與管理層)

Development & Management

Enterprise AI Software Suite
Optimized Container Registry
(內含 Guest OS Base, CUDA Toolkit 等)
HPC Workload Manager (e.g., Slurm)
AI Workload Orchestration Platform
Cloud-Native Monitoring

Runtime Environment

Inference Serving Platform
Inference Microservices Framework

Data Platform Services (資料平台服務)

Vector Database
Relational Database (RDBMS)
Data Processing & ETL Engine

L2: Virtualization & Orchestration Layer (虛擬化與容器調度層)

Container-Native Path (主流)

Container Orchestration (e.g., Kubernetes)
GPU Operator for K8s
Hardware-level GPU Partitioning

Virtualization Path (可選)

Enterprise Virtualization Platform (atop Hypervisor)

L1: Host System & Hypervisor Layer (主機系統與虛擬化層)

Host OS (e.g., Enterprise Linux)
Hypervisor (e.g., KVM/ESXi)
Bare-metal GPU Drivers

L0: Infrastructure Layer (基礎設施層)

Compute Servers

GPU-accelerated Nodes (Integrated & Modular)
CPU-only Nodes (Management)

Networking

Back-end Fabric (Low-Latency Interconnect)
Front-end Fabric (High-Speed Ethernet)

Storage

Parallel File System (Hot Data)
Object Storage (Data Lake)