後來想一想, 前篇留著當筆記, 對 pytorch 2.10.0 再開一篇文寫.
以下筆記我自己實驗出可用的版本, ubuntu 24.04.3, ROCm 7.1.1, pytorch 2.10.0, 目標 gfx803, gfx900. 以下都可以在虛擬機裡編譯, 不需要真的有 AMD 的 GPU. pytorch 選 2.10.0 是因為從這開始官方說開始支援 ROCm 7.0 (7.1), 2.9.0 官方說應對 ROCM 6.4. 而 ROCm 選 7.1.1 不選 7.2.0 我覺得是 pytorch 自己的問題, 之前用過 ROCm 7.2.0 編 rocBLAS, rocSOLVER 過後 pytorch 2.9.0 / 2.10.0 一直都編不過, 降版用 7.1.1 一試就成功.
(所以還是前篇那個說法, 看清楚哪版 pytorch 對哪版 ROCm)
# 環境變數裡加東西, 下次登入時就生效. ps. 加在 /etc/environment.d/ 裡無效, 理由不明.
# Add GFX803 related variables echo "ROC_ENABLE_PRE_VEGA=1" | sudo tee -a /etc/environment echo "HSA_OVERRIDE_GFX_VERSION=8.0.3" | sudo tee -a /etc/environment
安裝 miniconda 3. conda init 是在 $LOGUSER 下, 特別記清處 rocm-build 這個字, 後面會很常用上
# Download and Install Miniconda 3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -P /tmp/ sudo bash /tmp/Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda sudo sed -i 's|^PATH="|PATH="/opt/conda/bin:|' /etc/environment # Logout to apply environment exit # Logged in conda init source ~/.bashrc conda create -n rocm-build -y python=3.12 conda activate rocm-build
sudo apt install -y build-essential ccache git libjpeg-dev libjpeg-turbo8-dev libpng-dev libmsgpack-dev libssl-dev python3-virtualenv libboost-dev libboost1.83-dev libmsgpack-cxx-dev # Setup AMD repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/7.1.1 noble main" | sudo tee --append /etc/apt/sources.list.d/rocm.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update
# Install ROCm 7.1.1
sudo apt install rocm rocm-developer-tools rocm-ml-sdk rocm-ml-libraries rocm-hip-sdk rocm-hip-libraries
sudo sed -i 's|^PATH="|PATH="/opt/rocm/bin:|' /etc/environment
echo "ROCM_PATH=/opt/rocm" | sudo tee -a /etc/environment
# Logout to apply environment
exit此時如果要修改 rocm 安裝版本, 就要在 /etc/apt/sources.list.d/rocm.list 中修改版本數字跟 ubuntu 發行主版號:
#sudo nano /etc/apt/sources.list.d/rocm.list
rocBLAS
# Re-enable conda environment
conda activate rocm-build
# Download rocBLAS
sudo mkdir /opt/rocBLAS
sudo chown $LOGNAME: /opt/rocBLAS
git clone --recursive https://github.com/ROCm/rocBLAS.git -b rocm-7.1.1 /opt/rocBLAS
cd /opt/rocBLAS
# Install required packages
pip install "cmake<4.0" joblib pyyaml
# Build rocBLAS
time ./install.sh -a "gfx803;gfx900;gfx90a;gfx942;gfx1100" -b rocm-7.1.1
# Copy compiled library
sudo rsync -vrh /opt/rocBLAS/build/release/rocblas-install/lib/rocblas/library/ /opt/rocm/lib/rocblas/library/
sudo cp -f $(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")$(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")rocBLAS 這裡新加一個很重要的參數 -b, -b 參數是指定 Tensile branch, 這個最好跟 rocBLAS branch 一樣指到哪版就指定哪版, 否則預設會去抓最新版, 最新版往往要不根本編不起來, 要不根本不存在. 我在寫這篇文的時候編譯時 Tensile 指到 4.45, 但 github 上只 release 到 4.43. 這裡只能更確定做這些東西的人根本沒試過能不能正常安裝就丟出來了.
rocSOLVER 跟 rocBLAS 一樣 install.sh 要多下一個 -b 指定 Tensile 版號.
rocSOLVER的編法很奇怪, 最好特別連進去 ssh console 裡再操作 install.sh, 否則不會動作.
rocSOLVER
# Download rocBLAS
sudo mkdir /opt/rocSOLVER
sudo chown $LOGNAME: /opt/rocSOLVER
git clone --recursive https://github.com/ROCm/rocSOLVER.git -b rocm-7.1.1 /opt/rocSOLVER
cd /opt/rocSOLVER
# Build rocBLAS
time ./install.sh -a "gfx803;gfx900;gfx90a;gfx942;gfx1100" -b rocm-7.1.1
# Copy compiled library
sudo cp -f $(find /opt/rocBLAS/build/release/rocblas-install/lib/ -type f -name "librocblas.so.*") $(find /opt/rocm/ -type f -name "librocblas.so*")echo $(find ./build/release/rocsolver-install/lib/ -type f -name "librocsolver.so.*") $(find /opt/rocm/ -type f -name "librocsolver.so*")
# Reboot
如果你是在目標機器上編的話才要 reboot.
# Re-enable conda environment
conda activate rocm-build
# Download PyTorch
sudo mkdir /opt/pytorch
sudo chown $LOGNAME: /opt/pytorch
git clone --recursive https://github.com/pytorch/pytorch.git -b v2.10.0 /opt/pytorch
cd /opt/pytorch
# Install required packages
pip install mkl-static mkl-include ninja -r requirements.txt
# Build PyTorch
export PYTORCH_ROCM_ARCH="gfx803;gfx900"
export PYTORCH_BUILD_VERSION=2.10.0 PYTORCH_BUILD_NUMBER=1
python tools/amd_build/build_amd.py
time python setup.py bdist_wheel
# Install PyTorch
pip install /opt/pytorch/dist/torch-2.10.0-cp312-cp312-linux_x86_64.whlps1. pytorch 2.4.1 & 2.7.1 可以成功在 ROCM 6.3.3 上編譯, ROCm 7.1.1 則否.
ps2. torchvision 要編譯 0.25 版, 跟 pytorch 也是有對應版號的
https://github.com/pytorch/vision
之後就比照 https://github.com/NULL0xFF/rocm-gfx803?tab=readme-ov-file 這邊筆記操作.
後續有實驗成功的版本我會再貼上來.
沒有留言:
張貼留言