Writing

Case Studies

Engineering writeups from production AI systems.

Case Study 01

Building a Hybrid Resume Search Engine

Apolis · 2025–2026 · Enterprise Search · LLM Fine-Tuning

01 — Problem

Recruiters needed to search 2.7M+ resumes to find the right candidates for job openings. Boolean keyword search — the standard approach — missed semantic matches. A query for "machine learning engineer" would skip candidates who wrote "ML researcher" or "deep learning specialist." At scale, this meant real candidates were invisible to the platform.

Scale added a second constraint: sub-second response times with high precision across millions of documents. The system needed to be fast enough for interactive search while being semantically rich enough to surface genuinely relevant candidates.

02 — System Architecture

The system is a dual-index hybrid retrieval pipeline with an ML ranking layer on top. Each resume flows through extraction, gets indexed in two forms, and retrieval merges both signals before final ranking.

Pipeline Architecture

Ingestion→LLM Extraction→BM25 Indexing→Dense Embeddings→Hybrid Retrieval→ML Ranking→FastAPI API

Ingestion layer: Resumes arrive in multiple formats (PDF, DOCX, plain text). A preprocessing pipeline normalises formatting, strips noise, and segments documents into structured regions (skills, experience, education).

LLM extraction: A fine-tuned Qwen2.5-3B model extracts structured JSON from each resume — job titles, skills, years of experience, location, seniority level. This structured data feeds both the BM25 index (as keyword-rich fields) and the embedding model.

Dual indexing: Each resume is indexed in Elasticsearch as both a BM25 document (full-text and structured fields) and as a dense embedding vector. Hybrid retrieval queries both indexes in parallel and merges results using Reciprocal Rank Fusion (RRF).

ML ranking: A lightweight ranking model re-scores merged candidates using additional signals — skills overlap, experience match, location proximity, seniority alignment. Final results are returned ranked by composite score.

03 — Engineering Challenges

(a) Scale — indexing 2.7M documents. Processing millions of resumes synchronously is impractical. The solution was a batched async pipeline using Celery workers and Redis as the task queue. Documents are processed in parallel across worker nodes, with checkpointing to handle failures gracefully. The embedding generation step is the most expensive — batching with vLLM and using GPU concurrency cut per-document latency significantly.

(b) Extraction accuracy. Off-the-shelf LLMs (GPT-4, Claude) performed reasonably on clean resumes but degraded on noisy or non-standard formats. The solution was fine-tuning Qwen2.5-3B with QLoRA (4-bit NF4 quantization) using a teacher-student approach: GPT-4 generated high-quality structured extractions as training targets, and Qwen was fine-tuned to replicate this behavior at a fraction of the inference cost. The fine-tuned model consistently outperformed zero-shot GPT-3.5 on extraction precision.

(c) Ranking signal design. Retrieval returns candidates who are semantically similar to the query. Ranking determines which of those candidates actually fits the job. The ranking model blends: BM25 relevance score (keyword precision), embedding similarity (semantic match), structured field overlap (skills, role title, years of experience), and location signals. Weights were calibrated against a holdout set of recruiter-labeled candidate-job pairs.

04 — My Contributions

▸End-to-end pipeline architecture — from ingestion through API layer
▸Fine-tuning pipeline: dataset curation, QLoRA training, evaluation harness
▸Elasticsearch schema design: field mappings, analyzers, index settings for scale
▸Hybrid retrieval implementation: BM25 + dense embedding fusion via RRF
▸Ranking model: feature engineering, training, calibration
▸FastAPI service layer: async endpoints, caching, rate limiting

05 — Results

2.7M+ resumes indexed and searchable in production. Measurably improved candidate-job matching accuracy compared to keyword-only baseline. Sub-second retrieval latency across the full corpus. Fine-tuned extraction model running at <50ms per document on GPU.

06 — Lessons Learned

Hybrid always beats pure dense or pure sparse. Dense embeddings excel at semantic matching but struggle with exact keyword requirements (specific technologies, certifications, locations). BM25 handles precision well but misses paraphrases. Combining both and letting the ranking layer reconcile them consistently outperforms either alone.

Fine-tuned small models beat large zero-shot models for structured extraction. A 3B parameter model fine-tuned on domain-specific data produces more reliable structured outputs than a 70B model prompted zero-shot. Extraction tasks have clear right answers — fine-tuning exploits this. The added benefit is cost: a fine-tuned small model is orders of magnitude cheaper per inference.

Ranking signal design matters more than retrieval at scale. Once you have a retrieval system that surfaces a good candidate set, the quality difference comes from ranking. Investing in careful feature engineering and calibration against labeled data yields larger gains than further retrieval optimisation beyond a threshold of recall.