Build an LLM Twin: Part 5 - Building the Inference API
Part 5: Building the Inference API
Your LLM twin works locally. Let's make it callable from anywhere.
FastAPI in 30 Lines
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
app = FastAPI()
twin = pipeline('text-generation', model='./my-llm-twin')
class GenerateRequest(BaseModel):
prompt: str
max_length: int = 200
@app.post("/generate")
def generate(request: GenerateRequest):
output = twin(request.prompt, max_length=request.max_length)
return {"text": output[0]['generated_text']}
# Run: uvicorn main:app --reload
Test:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "My thoughts on coding:"}'
Done! Your twin is now an API.
Streaming Responses (ChatGPT-style)
from fastapi.responses import StreamingResponse
@app.post("/generate/stream")
async def generate_stream(request: GenerateRequest):
async def generate_tokens():
for token in twin(request.prompt, max_length=request.max_length, return_tensors=False):
yield f"data: {token}\n\n"
return StreamingResponse(generate_tokens(), media_type="text/event-stream")
Next Week
Part 6: Docker + Cloud deployment. Ship your twin to production!
Series Progress: Part 5 of 6 ✓ Next: Part 6 - Production Deployment (Aug 24)