tldr: PSU was blown (5V rail after PSU trips). Meanwhile the previous platform cannot finetune with full dataset, so half of the parts has been changed.
v3 is being explored. CPU internal (CCD) latency is high in EPYC, causing stalling when performing NCCL "all_reduce".
Some contents are repeated, which indicates they remains unchanged.
How the parts are gathered
Major parts from Taobao / Aliexpress. Others majorly from Carousell / DCFever and offline deals. The listed parts are stripped from ebay / amazon for "proof of existance".
Part List with Details
CPU:
AMD EPYC 7282. One of the cheapest CPU in 7002 / 7003 series. Low core count but moderate-high frequency. Followed.
official "low latency" guide for maximum training speed (still slower than X299).
Motherboard:
TYAN S8030. Picked S8030GM2NE / "OOY" version due to the lower cost. See
chiphell post for detailed review.
Dedicated PCIE power supply and
Dedicated PCIE power supply. is the major reason picking this board. The PCIE lane signal interference is considered the curpit of the stalling training speed.
RAM:
8x DDR4 2933Mhz 2Rx4 64GB RDIMM. 512GB in total.
A lot more expensive than the 2400Mhz 4Rx4 64GB LRDIMM. Brand / parts number are mixed: 2x
Samsumg M393A8G40MB2-CVF, 4x
Hynix HMAA8GR7AJR4N-WM, 2x
Micron 36ASF8G72PZ-2G9E1. Training process requires at least 288GB to operate, with *memory leaking* and *data workers* included, it consumes up to 480GB of memory. Swap files can be used, but there will be performance impact.
GPU1:
Gigabyte RTX 3090 Gaming OC, but modified with 3080 Turbo heatsink and become
Gigabyte 3090 Turbo. It took me many hours to mod: Most capacitors are needed to swap, meanwhile many connectors are needed to be removed, even the BIOS switch needed to be shaved, and finally the BIOS swap.
GPU2:
Colourful RTX 3090 OEM Blower. Obtained from a local deal. Close to brandnew. Works fine. However the factory BIOS mod made the fan spins around 60% when idle, which is loud. Others are 30%.
GPU3:
Zotac RTX 3090 OEM Blower "ZT-A30900A-10B". Originally released from goofish / xianyu, but I eventually obtained from a local deal. Condition was "looks good but overheat". Thermal paste / thermal pad refresh refresh solved the problem.
GPU4:
Manli RTX 3090 OEM Blower. Same deal with the Zotac, even the condition / PCB / BIOS mod are identical. Thermal paste / thermal pad refresh refresh solved the problem.
Parent company of both brand is PCPartner?
SSD1:
Intel / Solidigm DC P4510 3.84TB. Storing dataset and training results.
SSD2:
Samsung PM863a 1.92TB. Storing WebUI, codes, and logs. Notice that it is SATA instead of U.2. My SSD may be fauly: It drops connection (may not offine) after hours of idle.
SSD3:
Toshiba 960GB OEM SATA SSD "THNSN8960PCSE". Obtained from a nice deal. OS / codes consumes less than 64GB, and I assigned 512GB for swap file, which can persist for a longer term of memory leaking.
SSD4: 2x
WD SN750 2TB. with generic M.2 to 2.5" U.2 case. It was made for giant 4TB swap file, but I found that the PCIE signal interference stalled training speed even they are inactive.
PSU:
Great Wall 2000W full modular PSU "GW-EPS2000BL". "Premium mining PSU" replacing the blown PSU. 10x 12V rails so there is no more single rail overloading.
Frame:
Unbranded EATX aluminum open benchtable. I searched in
Taobao for many pages, for 4 card support, which is no more common. Since it is sheet based instead of column based, it is still more expensive than a oridinary PC case. Common choice will be mining rig with server board, but I don't have space.
(v2) Still 4 GPU in same case, but platform changed.
(v2) Special treatment to route the front panel pins.
(v2) Temperature from thermal camera.
(v2) Board view (buttons in the middle).
(v2) Finetune in progress (btop).
(v2) Finetune in progress (nvtop).