2026年2月7日 星期六

ROCm 7.1.1 & pytorch 2.10.0 gfx803 / gfx900 build note

後來想一想, 前篇留著當筆記, 對 pytorch 2.10.0 再開一篇文寫.

以下筆記我自己實驗出可用的版本, ubuntu 24.04.3, ROCm 7.1.1, pytorch 2.10.0, 目標 gfx803, gfx900. 以下都可以在虛擬機裡編譯, 不需要真的有 AMD 的 GPU. pytorch 選 2.10.0 是因為從這開始官方說開始支援 ROCm 7.0 (7.1), 2.9.0 官方說應對 ROCM 6.4. 而 ROCm 選 7.1.1 不選 7.2.0 我覺得是 pytorch 自己的問題, 之前用過 ROCm 7.2.0 編 rocBLAS, rocSOLVER 過後 pytorch 2.9.0 / 2.10.0 一直都編不過, 降版用 7.1.1 一試就成功.

(所以還是前篇那個說法, 看清楚哪版 pytorch 對哪版 ROCm)

# 環境變數裡加東西, 下次登入時就生效. ps. 加在 /etc/environment.d/ 裡無效, 理由不明.

# Add GFX803 related variables
echo "ROC_ENABLE_PRE_VEGA=1" | sudo tee -a /etc/environment
echo "HSA_OVERRIDE_GFX_VERSION=8.0.3" | sudo tee -a /etc/environment

安裝 miniconda 3. conda init 是在 $LOGUSER 下, 特別記清處 rocm-build 這個字, 後面會很常用上

# Download and Install Miniconda 3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -P /tmp/
sudo bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda
sudo sed -i 's|^PATH="|PATH="/opt/conda/bin:|' /etc/environment

# Logout to apply environment
exit

# Logged in
conda init
source ~/.bashrc
conda create -n rocm-build -y python=3.12
conda activate rocm-build

Install Prerequisites

sudo apt install -y build-essential ccache git libjpeg-dev libjpeg-turbo8-dev libpng-dev libmsgpack-dev libssl-dev python3-virtualenv libboost-dev libboost1.83-dev libmsgpack-cxx-dev 

ps. 這裡用 ROCm 7.1.1 非最新的 7.2.0, 理由已前述.

Install ROCm

ROCm 7.1.1

# Setup AMD repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/7.1.1 noble main" | sudo tee --append /etc/apt/sources.list.d/rocm.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update

# Install ROCm 7.1.1
sudo apt install rocm rocm-developer-tools rocm-ml-sdk rocm-ml-libraries rocm-hip-sdk rocm-hip-libraries
sudo sed -i 's|^PATH="|PATH="/opt/rocm/bin:|' /etc/environment
echo "ROCM_PATH=/opt/rocm" | sudo tee -a /etc/environment

# Logout to apply environment
exit

此時如果要修改 rocm 安裝版本, 就要在 /etc/apt/sources.list.d/rocm.list 中修改版本數字跟 ubuntu 發行主版號: 

#sudo nano /etc/apt/sources.list.d/rocm.list 

ps. 到寫這篇前還沒看到支援 ubuntu 26.04

rocBLAS 

# Re-enable conda environment
conda activate rocm-build

# Download rocBLAS
sudo mkdir /opt/rocBLAS
sudo chown $LOGNAME: /opt/rocBLAS
git clone --recursive https://github.com/ROCm/rocBLAS.git -b rocm-7.1.1 /opt/rocBLAS
cd /opt/rocBLAS

# Install required packages
pip install "cmake<4.0" joblib pyyaml

# Build rocBLAS
time ./install.sh -a "gfx803;gfx900;gfx90a;gfx942;gfx1100" -b rocm-7.1.1

# Copy compiled library
sudo rsync -vrh /opt/rocBLAS/build/release/rocblas-install/lib/rocblas/library/ /opt/rocm/lib/rocblas/library/
sudo cp -f $(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")$(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")

rocBLAS 這裡新加一個很重要的參數 -b, -b 參數是指定 Tensile branch, 這個最好跟 rocBLAS branch 一樣指到哪版就指定哪版, 否則預設會去抓最新版, 最新版往往要不根本編不起來, 要不根本不存在. 我在寫這篇文的時候編譯時 Tensile 指到 4.45, 但 github 上只 release 到 4.43. 這裡只能更確定做這些東西的人根本沒試過能不能正常安裝就丟出來了.

rocSOLVER 跟 rocBLAS 一樣 install.sh 要多下一個 -b 指定 Tensile 版號.

rocSOLVER的編法很奇怪, 最好特別連進去 ssh console 裡再操作 install.sh, 否則不會動作.

rocSOLVER

# Download rocBLAS
sudo mkdir /opt/rocSOLVER
sudo chown $LOGNAME: /opt/rocSOLVER
git clone --recursive https://github.com/ROCm/rocSOLVER.git -b rocm-7.1.1 /opt/rocSOLVER
cd /opt/rocSOLVER

# Build rocBLAS
time ./install.sh -a "gfx803;gfx900;gfx90a;gfx942;gfx1100" -b rocm-7.1.1
# Copy compiled library sudo cp -f $(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")echo $(find ./build/release/rocsolver-install/lib/ -type f -name "librocsolver.so.*") $(find /opt/rocm/ -type f -name "librocsolver.so*") # Reboot

如果你是在目標機器上編的話才要 reboot.

Install PyTorch

PyTorch 2.10.0

# Re-enable conda environment
conda activate rocm-build

# Download PyTorch
sudo mkdir /opt/pytorch
sudo chown $LOGNAME: /opt/pytorch
git clone --recursive https://github.com/pytorch/pytorch.git -b v2.10.0 /opt/pytorch
cd /opt/pytorch

# Install required packages
pip install mkl-static mkl-include ninja -r requirements.txt

# Build PyTorch
export PYTORCH_ROCM_ARCH="gfx803;gfx900"
export PYTORCH_BUILD_VERSION=2.10.0 PYTORCH_BUILD_NUMBER=1
python tools/amd_build/build_amd.py
time python setup.py bdist_wheel

# Install PyTorch
pip install /opt/pytorch/dist/torch-2.10.0-cp312-cp312-linux_x86_64.whl

ps1. pytorch 2.4.1 & 2.7.1 可以成功在 ROCM 6.3.3 上編譯, ROCm 7.1.1 則否.

ps2. torchvision 要編譯 0.25 版, 跟 pytorch 也是有對應版號的
https://github.com/pytorch/vision

之後就比照 https://github.com/NULL0xFF/rocm-gfx803?tab=readme-ov-file 這邊筆記操作.

後續有實驗成功的版本我會再貼上來.



沒有留言: