Building Production-Grade RAG Pipelines with Node.js, OpenAI, and Pinecone
Published 2026-04-13 · 9 min read
Most developers first encounter Retrieval-Augmented Generation (RAG) through a Python tutorial — a Jupyter notebook, a few lines of LangChain, and suddenly their PDF is "talking."
Most developers first encounter Retrieval-Augmented Generation (RAG) through a Python tutorial — a Jupyter notebook, a few lines of LangChain, and suddenly their PDF is "talking." That demo works. But the moment you try to take that same pattern into a production Node.js backend serving real users, you run into a completely different class of problems: chunking strategies that affect retrieval quality, embedding costs that spiral if not managed, namespace isolation when multiple users upload documents, cold-start latency on vector queries, and the question of how to structure your API so the whole thing is actually testable and maintainable.
This article is not a "hello world" RAG tutorial. It is a breakdown of the architecture, decisions, and tradeoffs I worked through while building a production RAG pipeline in Node.js and TypeScript — one that handles PDF ingestion, OpenAI embeddings, Pinecone vector storage, semantic retrieval, and structured JSON output as a REST API with JWT authentication and Docker deployment.
What RAG Actually Is (And What It Is Not)
Before going into implementation, it is worth being precise about what RAG solves, because a lot of developers conflate it with fine-tuning or with plain prompt stuffing.
When you pass an entire document directly into an LLM's context window, you are doing prompt stuffing. It works for small documents, but it is expensive, slow, and hits token limits fast. Fine-tuning, on the other hand, bakes knowledge directly into model weights — it is costly, time-consuming, and difficult to update when your data changes.
RAG sits between these two approaches. The core idea is simple: instead of giving the LLM everything, you first retrieve only the most semantically relevant fragments of your document, and then give the LLM just those fragments as context alongside the user's question. The LLM does not need to "know" your document — it just needs to reason over the pieces you hand it at query time.
This means your document knowledge lives in a vector database, not in the model. It is updatable, isolated per user if needed, and far cheaper to operate at scale.
The Full Pipeline — An Overview
The pipeline I built has two distinct phases, and understanding this separation is fundamental to keeping the architecture clean.
The ingestion phase runs once per document and is entirely about preparation. It takes a PDF, extracts raw text, splits that text into overlapping chunks, generates vector embeddings for each chunk using OpenAI's embedding API, and stores those vectors in Pinecone. This phase is triggered by a document upload endpoint and runs asynchronously.
The retrieval phase runs on every user query. It takes the user's question, generates an embedding for that question using the same OpenAI model, queries Pinecone for the top-K most semantically similar chunks, assembles those chunks into a structured prompt, sends that prompt to the GPT completion API, and returns a structured JSON response containing the answer, a match score, identified skill gaps, and actionable suggestions.
The reason this separation matters in production is that ingestion is expensive and slow (multiple API calls, database writes) while retrieval needs to be fast (ideally under two seconds end-to-end). If you mix them, you end up with endpoints that are unpredictably slow and hard to optimize independently.
Chunking Strategy — Where Most RAG Implementations Go Wrong
Text chunking is the step that most tutorials either skip or handle naively, and it is the single biggest factor affecting retrieval quality. If your chunks are too large, you waste tokens and reduce retrieval precision. If they are too small, you lose semantic context and the retrieved fragments do not make sense in isolation.
The approach that worked best in practice is fixed-size chunking with overlap. I used chunks of 500 tokens with a 50-token overlap between consecutive chunks. The overlap is critical — it ensures that a sentence or concept that straddles a chunk boundary is not lost. Without overlap, you can end up in a situation where the most relevant sentence in the document falls exactly at a boundary and gets split between two chunks, neither of which retrieves as the top result.
In TypeScript, the chunking logic looks conceptually like this:
// Split text into overlapping chunks for better semantic coverage.
// chunkSize controls how much text each vector represents.
// overlap ensures boundary sentences are captured in both adjacent chunks.
function chunkText(text: string, chunkSize = 500, overlap = 50): string[] {
const words = text.split(' ');
const chunks: string[] = [];
let start = 0;
while (start < words.length) {
const end = Math.min(start + chunkSize, words.length);
// Join the slice back into a readable string for embedding
chunks.push(words.slice(start, end).join(' '));
// Move forward by (chunkSize - overlap) so the next chunk
// re-includes the last `overlap` words of this chunk
start += chunkSize - overlap;
}
return chunks;
}
One production concern here: OpenAI's embedding model (text-embedding-3-small) has a token limit per request. If you are chunking aggressively and your documents are dense, you need to batch your embedding requests rather than firing them all simultaneously, or you will hit rate limits and your ingestion will fail silently.
Pinecone Namespace Isolation — Critical for Multi-User Systems
If you are building a system where multiple users upload their own documents, namespace isolation in Pinecone is not optional — it is the difference between a multi-tenant system and a security incident.
Pinecone namespaces allow you to logically partition your vector index so that a query in one namespace never touches vectors in another. The implementation is straightforward but easy to forget:
// Each document gets its own namespace — derived from the document's
// unique ID in MongoDB. This guarantees one user's resume vectors
// can never surface in another user's semantic query.
const namespace = `doc_${documentId}`;
await pineconeIndex.namespace(namespace).upsert(vectors);
When querying, you pass the same namespace derived from the document ID the user is currently working with. This gives you strong isolation with zero cross-contamination between users' data. In an enterprise context, this is also your primary mechanism for achieving data tenancy — one namespace per organisation or per document, depending on your access model.
The Retrieval Prompt — Structured Output Is Non-Negotiable
One of the biggest mistakes in production RAG systems is treating the LLM's output as free-form text that you then try to parse with regex or string manipulation. This is fragile, unpredictable, and breaks the moment the model's phrasing changes slightly.
The correct approach is to instruct the model explicitly to return structured JSON, and to define the exact shape of that JSON in your system prompt. Your application then parses the response as a typed object, which means TypeScript can enforce it at compile time.
const systemPrompt = `
You are an expert resume analyst. You will receive resume content as context
and a job description as the query. Analyse the match and respond ONLY with
a valid JSON object in this exact shape — no preamble, no markdown fences:
{
"matchScore": number between 0 and 100,
"summary": "two sentence overview of the candidate fit",
"skillGaps": ["array", "of", "missing", "skills"],
"suggestions": ["array", "of", "actionable", "improvement", "steps"]
}
`;
By making "respond ONLY with a valid JSON object" an explicit instruction and defining the exact schema inline, you eliminate the parsing guesswork. In practice I also wrap the JSON.parse call in a try-catch with a structured fallback, because LLMs do occasionally produce slightly malformed JSON under load — and a 500 error on a parsing failure is not an acceptable user experience.
API Architecture — Three Layers, Clean Boundaries
The REST API that exposes this pipeline follows the same three-layer architecture I use for all backend systems: controllers handle HTTP, services contain business logic, and repositories manage external I/O (database, vector store, third-party APIs). This separation makes the codebase testable — you can unit test the service layer by mocking the repository, without spinning up Pinecone or making real OpenAI calls in your test suite.
The two primary endpoints are clean and purposeful.
POST /api/documents/upload
accepts a PDF multipart upload, runs the ingestion pipeline asynchronously (extract → chunk → embed → store), and returns a document ID immediately so the client does not wait.
POST /api/documents/:id/analyze
accepts a job description in the request body, runs the retrieval pipeline synchronously, and returns the structured JSON analysis. Both endpoints are protected by JWT middleware that validates the token and extracts the user ID, which is then used to scope the Pinecone namespace correctly.
Docker Deployment — The Part Tutorials Always Skip
A RAG pipeline that only runs on your laptop is not a RAG pipeline — it is a prototype. For production deployment, the entire application needs to be containerised so it behaves identically across environments.
The key consideration in the Dockerfile for a Node.js AI application is that your node_modules will be large (OpenAI SDK, Pinecone SDK, pdf-parse, and their dependencies add up), so a multi-stage build is worth the effort to keep your final image lean. You also want to ensure your environment variables for
OPENAI_API_KEY, PINECONE_API_KEY, and PINECONE_INDEX
are never baked into the image — they come in at runtime via Docker environment variables or AWS Secrets Manager in a cloud deployment.
Lessons From Running This in Production
Three things surprised me that no tutorial mentioned. First, PDF extraction quality varies enormously by PDF type. Text-based PDFs parse cleanly with pdf-parse. Scanned PDFs — which are essentially images — return garbage without an OCR pre-processing step. Know your input type before choosing your extraction library.
Second, embedding costs are predictable if you track them, but invisible if you do not. OpenAI charges per token for embeddings. On a small scale this is negligible, but if you are embedding large documents for many users, you need to implement a caching layer (Redis works well) so that a document that has already been ingested does not get re-embedded on every upload. Store the embedding status in MongoDB and skip ingestion if the document is already processed.
Third, Pinecone query latency on a starter plan can spike under load. For a production system handling concurrent users, pre-warming your Pinecone index and considering a dedicated pod over a serverless index will noticeably improve the p95 latency of your retrieval endpoint.
Final Thoughts
RAG is not a feature you bolt onto a backend — it is an architectural pattern with its own pipeline, its own failure modes, and its own cost model. The implementation described here is production-ready in the sense that it handles multi-user isolation, structured output, testable service layers, and containerised deployment — not just a happy-path demo.
If you are a backend engineer looking to add AI/LLM integration to your skill set, RAG is the highest-leverage starting point because it draws directly on skills you already have: REST API design, database management, authentication, and cloud deployment. The AI layer is just one more external service you are integrating — and that is exactly the framing that makes it tractable.
The full project is available on my portfolio at mfhassan.work.
Md Faizan Hassan is a Senior Backend Engineer with 5+ years of experience in Node.js, TypeScript, AWS Microservices, and AI/LLM backend systems. Open to Senior Backend and AI-Integrated Backend roles.
Connect on LinkedIn
Have a look on Github