Lead generative ai infrastructure engineer (remote eligible)
San Jose
...work on: Deploy a thousand-node training cluster optimizing storage and networking stack, with tightly coupled training pipelines to take advantage of multiple parallelism strategies, in our public cloud. Design and build fault-tolerant infrastructure to support long-running large-scale training tasks reliably despite failure [...]
Category Engineering & Architecture