Kestrel
All writing
Engineering06 May 2026 · 14 min read

Building the India company graph: dedup, names, and Devanagari

The hardest engineering problem in Indian B2B data is not ingestion. It is identity resolution.

MCA, GSTN, Udyam, and DGFT each have their own identifier for a company. MCA has CIN. GSTN has GSTIN (a state-specific identifier; a company with operations in 5 states has 5 GSTINs). Udyam has its own number. DGFT has IEC. None of them are stable references to each other.

The first hard problem is joining them. A company's PAN appears on MCA filings, on its GSTIN (positions 3 to 12), and on its IEC. The PAN is the cross-source key. But MCA does not always expose PAN, GSTIN exposes PAN-by-construction, Udyam exposes PAN on the certificate. Building the join requires reconciling all three.

The second hard problem is name. The same company can appear as "Madhuban Foods Private Limited" on MCA, "MADHUBAN FOODS PVT LTD" on GST, "Madhuban Foods (P) Ltd" on a trade lead, and "मधुबन फूड्स प्राइवेट लिमिटेड" on the GST certificate. Normalization needs to strip suffixes, fold case, transliterate Devanagari, handle "Pvt" / "Private" / "P." / "(P)" variants.

We use a four-stage dedup: (1) exact match on the normalized form, (2) trigram similarity for fuzzy matches, (3) embedding similarity (sentence-transformers, 384 dim, pgvector) for semantic matches, (4) human review on the edge cases. The thresholds are tuned per source, because MCA records are clean and aggregator records are not.

The third hard problem is directors. A DIN identifies an individual person across all companies. But director records have their own dirty data: name variants, multiple companies, prior cessations, dormant DINs. We build a person-to-company graph with valid-from and valid-to dates per appointment, and we link person records by DIN, not by name.

The output is a graph where every company has one canonical record, joined to every identifier it has, and every director appears once with all their appointments attached. That graph is the foundation everything else (signals, trade, search, API) is built on. Get the graph wrong and everything downstream produces noise.

Kestrel is the India-first GTM data engine. Search 1.89 million active companies, track 15 buying-signal types, and call the public enrichment API.

Try Kestrel free

More writing.