Part 5: Building the Inference API

Your LLM twin works locally. Let's make it callable from anywhere.

FastAPI in 30 Lines

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()
twin = pipeline('text-generation', model='./my-llm-twin')

class GenerateRequest(BaseModel):
    prompt: str
    max_length: int = 200

@app.post("/generate")
def generate(request: GenerateRequest):
    output = twin(request.prompt, max_length=request.max_length)
    return {"text": output[0]['generated_text']}

# Run: uvicorn main:app --reload

Test:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "My thoughts on coding:"}'

Done! Your twin is now an API.

Streaming Responses (ChatGPT-style)

from fastapi.responses import StreamingResponse

@app.post("/generate/stream")
async def generate_stream(request: GenerateRequest):
    async def generate_tokens():
        for token in twin(request.prompt, max_length=request.max_length, return_tensors=False):
            yield f"data: {token}\n\n"

    return StreamingResponse(generate_tokens(), media_type="text/event-stream")

Next Week

Part 6: Docker + Cloud deployment. Ship your twin to production!


Series Progress: Part 5 of 6 ✓ Next: Part 6 - Production Deployment (Aug 24)