Part 5: Building the Inference API

Your LLM twin works locally. Let's make it callable from anywhere.

FastAPI in 30 Lines

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()
twin = pipeline('text-generation', model='./my-llm-twin')

class GenerateRequest(BaseModel):
    prompt: str
    max_length: int = 200

@app.post("/generate")
def generate(request: GenerateRequest):
    output = twin(request.prompt, max_length=request.max_length)
    return {"text": output[0]['generated_text']}

# Run: uvicorn main:app --reload

Test:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "My thoughts on coding:"}'

Done! Your twin is now an API.

Streaming Responses (ChatGPT-style)

from fastapi.responses import StreamingResponse

@app.post("/generate/stream")
async def generate_stream(request: GenerateRequest):
    async def generate_tokens():
        for token in twin(request.prompt, max_length=request.max_length, return_tensors=False):
            yield f"data: {token}\n\n"

    return StreamingResponse(generate_tokens(), media_type="text/event-stream")

Next Week

Part 6: Docker + Cloud deployment. Ship your twin to production!

Series Progress: Part 5 of 6 ✓ Next: Part 6 - Production Deployment (Aug 24)

Part 5: Building the Inference API

FastAPI in 30 Lines

Streaming Responses (ChatGPT-style)

Next Week

Share this article

Related Articles

Build an LLM Twin: Part 6 - Deploy to Production

Build an LLM Twin: Part 4 - Fine-Tuning on Your Writing

Build an LLM Twin: Part 3 - Vector Embeddings Made Simple