2026年2月11日 星期三

ROCm 6.4.4 & pytorch 2.9.1 gfx803 / gfx900 build note

 後來想一想前篇留著當筆記, pytorch 2.9.1 再開一篇文寫.

以下筆記我自己實驗出可用的版本, ubuntu 24.04.3, ROCm 6.4.4, pytorch 2.9.1, 目標 gfx803, gfx900. 以下都可以在虛擬機裡編譯, 不需要真的有 AMD GPU. Pytorch 2.10 開始支援的 ROCm 7 只是可以編的過, 在實體機器上無法支援 RX580 (gfx803). 可能有其它安裝 amdgpu-install 其它版本可以繞過這個限制的方法? 目前沒試出來, 所以這篇就以 ROCm 6.4.4 (6.x 最後一版), pytorch 2.9.1 為主.

(還是得注意哪版 pytorch 對哪版 ROCm)

#首先 /etc/default/grub 裡替開機參數加東西, GRUB_CMDLINE_LINUX_DEFAULT 後面加上 "intel_iommu=on, iommu=pt", intel_iommu 是對 intel 用的, AMD ryzen 體系的預設 amd_iommu 都是 on, ryzen 之前的要看板子.

# 環境變數裡加東西, 下次登入時就生效. ps. 加在 /etc/environment.d/ 裡無效, 理由不明.

# Add GFX803 related variables
echo "ROC_ENABLE_PRE_VEGA=1" | sudo tee -a /etc/environment
echo "HSA_OVERRIDE_GFX_VERSION=8.0.3" | sudo tee -a /etc/environment

安裝 miniconda 3. conda init 是在 $LOGUSER , 特別記清處 rocm-build 這個字, 後面會很常用上

Miniconda 3

# Download and Install Miniconda 3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -P /tmp/
sudo bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda
sudo sed -i 's|^PATH="|PATH="/opt/conda/bin:|' /etc/environment

# Logout to apply environment
exit

# Logged in
conda init
source ~/.bashrc
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda create -n rocm-build -y python=3.12
conda activate rocm-build

Install Prerequisites

sudo apt install -y build-essential ccache git libjpeg-dev \
libjpeg-turbo8-dev libpng-dev libmsgpack-dev libssl-dev \
python3-virtualenv libboost-dev libboost1.83-dev libmsgpack-cxx-dev \
ninja-build

ROCm 6.4.4

# Setup AMD repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.4.4 noble main" | sudo tee --append /etc/apt/sources.list.d/rocm.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update

# Install ROCm 6.4.4
sudo apt install rocm rocm-developer-tools rocm-ml-sdk rocm-ml-libraries rocm-hip-sdk rocm-hip-libraries
sudo sed -i 's|^PATH="|PATH="/opt/rocm/bin:|' /etc/environment
echo "ROCM_PATH=/opt/rocm" | sudo tee -a /etc/environment

# Logout to apply environment
exit

此時如果要修改 rocm 安裝版本, 就要在 /etc/apt/sources.list.d/rocm.list 中修改版本數字跟 ubuntu 發行主版號

#sudo nano /etc/apt/sources.list.d/rocm.list 

ps. 到寫這篇前還沒看到支援 ubuntu 26.04

rocBLAS 

# Re-enable conda environment
conda activate rocm-build

# Download rocBLAS
sudo mkdir /opt/rocBLAS
sudo chown $LOGNAME: /opt/rocBLAS
git clone --recursive https://github.com/ROCm/rocBLAS.git -b rocm-6.4.4 /opt/rocBLAS
cd /opt/rocBLAS

# Install required packages
pip install "cmake<4.0" joblib pyyaml virtualenv typing-extensions

# Build rocBLAS
time ./install.sh -a "gfx803;gfx900;gfx90a;gfx942;gfx1100" -b rocm-6.4.4

# Copy compiled library
sudo rsync -vrh /opt/rocBLAS/build/release/rocblas-install/lib/rocblas/library/ /opt/rocm/lib/rocblas/library/
sudo cp -f $(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")$(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")

rocBLAS 這裡新加一個很重要的參數 -b, -b 參數是指定 Tensile branch, 這個最好跟 rocBLAS branch 一樣指到哪版就指定哪版, 否則預設會去抓最新版, 最新版往往要不根本編不起來, 要不根本不存在. 我在寫這篇文的時候編譯時 Tensile 指到 4.45, github 上只 release 4.43. 這裡只能更確定做這些東西的人根本沒試過能不能正常安裝就丟出來了.

rocSOLVER

ps. rocSOLVER ROCm 7.0 開始才能指定 Tensile 版本, 如果是 7.0 之後的版本就建議跟 rocBLAS 一樣 install.sh 要多下一個 -b 指定 Tensile 版號.

(update: ROCm 7.0以上特有) rocSOLVER的編法很奇怪, 最好特別連進去 ssh console 裡再操作 install.sh, 否則不會動作.

# Download rocSOLVER
sudo mkdir /opt/rocSOLVER
sudo chown $LOGNAME: /opt/rocSOLVER
git clone --recursive https://github.com/ROCm/rocSOLVER.git -b rocm-6.4.4 /opt/rocSOLVER
cd /opt/rocSOLVER

# Build rocBLAS
time ./install.sh -a "gfx803;gfx900;gfx90a;gfx942;gfx1100"


# Copy compiled library
sudo cp -f $(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")echo $(find ./build/release/rocsolver-install/lib/ -type f -name "librocsolver.so.*") $(find /opt/rocm/ -type f -name "librocsolver.so*")

# Reboot 

如果你是在目標機器上編的話才要 reboot.

PyTorch 2.9.1

# Re-enable conda environment
conda activate rocm-build

# Download PyTorch
sudo mkdir /opt/pytorch
sudo chown $LOGNAME: /opt/pytorch
git clone --recursive https://github.com/pytorch/pytorch.git -b v2.9.1 /opt/pytorch
cd /opt/pytorch

# Install required packages
pip install mkl-static mkl-include -r requirements.txt

# Build PyTorch
export PYTORCH_ROCM_ARCH="gfx803;gfx900"
export PYTORCH_BUILD_VERSION=2.9.1 PYTORCH_BUILD_NUMBER=1
python tools/amd_build/build_amd.py
time python setup.py bdist_wheel

# Install PyTorch
pip install /opt/pytorch/dist/torch-2.9.1-cp312-cp312-linux_x86_64.whl

ps1. torchVision 要編譯 0.24.1 , pytorch 也是有對應版號的
https://github.com/pytorch/vision

ps2. torchAudio 對應版號是 2.9.1 

之後就比照 https://github.com/NULL0xFF/rocm-gfx803?tab=readme-ov-file 這邊筆記操作.
後續有實驗成功的版本我會再貼上來.

update: 操蛋的, 花了我三個禮拜試一堆組合後終於可以動了....

MNIST PyTorch example

  1. Clone the PyTorch examples repository.

    git clone https://github.com/pytorch/examples.git
    
  2. Go to the MNIST example folder.

    cd examples/mnist

以上引用自 https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html



沒有留言: