When you send a message to ChatGPT, it travels over the internet to OpenAI's servers and streams the response back as your answer is being generated. It's a simple concept, but underneath the hood is a lot of complexity. Furthermore, it's not just you sending that message -- over 300 million other people are also using ChatGPT every week. So if an LLM can only process a couple hundred messages at a time, how do they do it?
This article kicks off our mini-series on post-training AI infrastructure - what happens after a model has been trained. It's meant for people who are interested in AI, but might not have a full picture of what all goes into deploying it. We'll explore how MLOps and post-training engineers take a trained model and squeeze every bit of performance out of it to serve millions of users simultaneously. We're starting with deployment basics today. In future articles, we'll work our way up to a very sophisticated inferencing infrastructure from the ground up.