這篇只是筆記, 主要引述自(NULL0xFF / rocm-gfx803)這裡的編譯指引, 這篇主要針對 ubuntu 22.04 的環境使用, 對的是 ROCm 6.1.5, 所以 pytorch 最好的配對版本是 2.4.1, pytorch 對 ROCm 的版本基本不會是自由配對, 可以參考這裡, 而 pytorch 每個版本編譯情況不一樣, 通常要選好特定的版本下手比較穩..
以下筆記我自己實驗出可用的版本, ubuntu 24.04.3, ROCm 6.3.3, pytorch 2.4.1, 目標 gfx803, gfx900. 以下都可以在虛擬機裡編譯, 不需要真的有 AMD 的 GPU.
# 環境變數裡加東西, 下次登入時就生效. ps. 加在 /etc/environment.d/ 裡無效, 理由不明.
# Add GFX803 related variables echo "ROC_ENABLE_PRE_VEGA=1" | sudo tee -a /etc/environment echo "HSA_OVERRIDE_GFX_VERSION=8.0.3" | sudo tee -a /etc/environment
安裝 miniconda 3. conda init 是在 $LOGUSER 下, 特別記清處 rocm-build 這個字, 後面會很常用上
# Download and Install Miniconda 3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -P /tmp/ sudo bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda sudo sed -i 's|^PATH="|PATH="/opt/conda/bin:|' /etc/environment # Logout to apply environment exit # Logged in conda init source ~/.bashrc conda create -n rocm-build -y python=3.12 conda activate rocm-build
sudo apt install -y build-essential ccache git libjpeg-dev libjpeg-turbo8-dev libpng-dev libmsgpack-dev libssl-dev python3-virtualenv libboost-dev libboost1.83-dev libmsgpack-cxx-dev ps. ROCm 6.3 最後一版是 6.3.4, 但 rocblas 關聯 ROCm 6.3 的部份只到 6.3.3, 所以以下 ROCm 使用 6.3.3
Install ROCm (基本到 7.2.0 也一樣)
# Setup AMD repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.3.3 noble main" | sudo tee --append /etc/apt/sources.list.d/rocm.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update
# Install ROCm 6.3.3
sudo apt install rocm rocm-developer-tools rocm-ml-sdk rocm-ml-libraries rocm-hip-sdk rocm-hip-libraries
sudo sed -i 's|^PATH="|PATH="/opt/rocm/bin:|' /etc/environment
echo "ROCM_PATH=/opt/rocm" | sudo tee -a /etc/environment
# Logout to apply environment
exit此時如果要修改 rocm 安裝版本, 就要在 /etc/apt/sources.list.d/rocm.list 中修改版本數字跟 ubuntu 發行主版號:
#sudo nano /etc/apt/sources.list.d/rocm.list
rocBLAS
# Re-enable conda environment
conda activate rocm-build
# Download rocBLAS
sudo mkdir /opt/rocBLAS
sudo chown $LOGNAME: /opt/rocBLAS
git clone --recursive https://github.com/ROCm/rocBLAS.git -b rocm-6.3.3 /opt/rocBLAS
cd /opt/rocBLAS
# Install required packages
pip install "cmake<4.0" joblib pyyaml
# Build rocBLAS
time ./install.sh -a "gfx803;gfx900;gfx906;gfx942;gfx1100"
# Copy compiled library
sudo rsync -vrh /opt/rocBLAS/build/release/rocblas-install/lib/rocblas/library/ /opt/rocm/lib/rocblas/library/
sudo cp -f $(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")$(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")rocBLAS 這裡會編譯從 ROCm 6.0 開始就被拿掉的 gfx803, gfx900 hsaco 等 kernel 檔案, 以及把檔案關聯放進 librocblas.so 裡 (ROCm 5.0 開始 gfx803, gfx900 有檔案但沒有相關連結), 因為如果全部的 GPU 都塞進來的話, 會在後面編 pytorch ldd 時因為 so 檔太大無法成功連結, 所以選好自己想要編進去的東西. 而 gfx906/942/1100 這三個是 pytorch 要的, 不是我想加的...
rocSOLVER的編法很奇怪, 最好特別連進去 ssh console 裡再操作 install.sh, 否則不會動作.
rocSOLVER
# Download rocBLAS
sudo mkdir /opt/rocSOLVER
sudo chown $LOGNAME: /opt/rocSOLVER
git clone --recursive https://github.com/ROCm/rocSOLVER.git -b rocm-6.3.3 /opt/rocSOLVER
cd /opt/rocSOLVER
# Build rocBLAS
time ./install.sh -a "gfx803;gfx900;gfx906;gfx942;gfx1100"
# Copy compiled library
sudo cp -f $(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")echo $(find ./build/release/rocsolver-install/lib/ -type f -name "librocsolver.so.*") $(find /opt/rocm/ -type f -name "librocsolver.so*")
# Reboot
如果你是在目標機器上編的話才要 reboot.
# Re-enable conda environment
conda activate rocm-build
# Download PyTorch
sudo mkdir /opt/pytorch
sudo chown $LOGNAME: /opt/pytorch
git clone --recursive https://github.com/pytorch/pytorch.git -b v2.7.1 /opt/pytorch
cd /opt/pytorch
# Install required packages
pip install mkl-static mkl-include ninja -r requirements.txt
# Build PyTorch
export PYTORCH_ROCM_ARCH="gfx803;gfx900"
export PYTORCH_BUILD_VERSION=2.7.1 PYTORCH_BUILD_NUMBER=1
python tools/amd_build/build_amd.py
time python setup.py bdist_wheel
# Install PyTorch
pip install /opt/pytorch/dist/torch-2.7.1-cp312-cp312-linux_x86_64.whlps1. 2.5.0 & 2.5.1 我嘗試編過, 失敗, 2.6我沒試, 2.7.1可以順利編過, 2.8.0 檔案下載 SHA256 有問題我找不到可以改哪裡, 2.9.0 / 2.10.0 在 aotriton code 裡有指定可以過關的 GPU 清單, 裡面當然不會有 gfx803 / gfx900.
ps2. 2.4.x & 2.5.x 編完記得把 ~/.triton/cache ~/.triton/dump 清一清.
之後就比照 https://github.com/NULL0xFF/rocm-gfx803?tab=readme-ov-file 這邊筆記操作.
後續有實驗成功的版本我會再貼上來.