Canonical-companies enrichment run — 2026-04-19 · Memos

Executed after the canonicalization pass (see canonicalization_20260419.md) to lift the 15 real entity_ids from their Similar-50 cosine baseline (~0.30 enrichment_score, sparse metadata, no fresh embeddings) to TandemStride-style density on the /companies/[slug] pages.

Run was done locally on the user's M-series Mac via llama.cpp — the Linode GPU burst cluster was unreachable at run time, and the targeted 15-entity scope is small enough to complete in minutes on the local GPU without needing a provisioned cluster.

Infrastructure used

Inference: llama-server on :8080, model gemma-4-E4B-it-Q4_K_M.gguf, 16k context, --n-gpu-layers 99. Launched in a detached screen session.
Embedding: llama-server --embedding on :8081, model jina-embeddings-v5-text-small-retrieval/v5-small-retrieval-Q8_0.gguf, 8k context, --n-gpu-layers 99. Produces 1024-dim vectors (matches VECTOR_DIM in config.py).
Pipeline: enrich_cli.py entity <id> for core enrichment (Q0 grounding → Q1 profile → Q2 media → Q3 relationships → Q4 character stats → Q5 visual profile → taxonomy), followed by a direct get_embeddings_batch pass through the shared _build_entity_embed_text helper for deterministic re-embed.

Sequence

Smoke-run on #737992 (UnitedHealth Group) with 4k ctx: Q1 core profile failed with request (4184 tokens) exceeds the available context size (4096 tokens). Score landed at 0.972 anyway via Q2–Q5.
Restarted llama-server with 16k ctx.
Batched all 15 entities through enrich_cli.py entity with an 8-second pause every 3 entities (per the user's API-rate-limit rule).
Brought up the embedding server on :8081.
Targeted re-embed of all 15 via Jina-v5 (1024-dim).
Re-ran enrichment on the five low-scorers (≤65%) with the full stack healthy to give them every possible lift.

Before → After

All 15 went from Similar-50 baseline (~0.30, no fresh embedding, cosine-peer metadata only) to the scores below. Mean enrichment score: 0.839. Re-embed coverage: 15/15 with embedding_updated_at populated.

entity_id    score   passes   re-emb   Name
---------    -----   ------   ------   ---------------------------------
  737992     97.7%     3       yes     UnitedHealth Group Incorporated
 1166349     64.0%     4       yes     Optum Care
  427873     89.7%     4       yes     OptumRx
 1166348     61.4%     4       yes     Optum Bank
  833848    100.0%     6       yes     OptimizeRx Corporation
  629634    100.0%     3       yes     Universal Health Services
  673584     65.2%     4       yes     Sharp HealthCare
  673545     65.3%     3       yes     Prime Healthcare
  573586     72.0%     3       yes     Deloitte Canada
 1317492     95.6%     3       yes     Deloitte Risk & Financial Advisory
  825797    100.0%     2       yes     Marsh McLennan
  827137     61.0%     3       yes     MagnaChip Semiconductor Corp.
  856177    100.0%     6       yes     Skyworks Solutions
  387081     93.5%     2       yes     SMIC
  389485     93.5%     2       yes     Meta Platforms Inc.

Score-ceiling finding

Five entities plateaued at 61–65% across multiple passes even with the full pipeline (16k ctx inference + Jina embed + Q0 web grounding):

Optum Care · Optum Bank · Sharp HealthCare · Prime Healthcare · MagnaChip Semiconductor

The ceiling is public data availability, not pipeline health:

Q0 grounding returns thin snippets for these entities because they're subsidiaries / specialty operators / less-covered niches.
Q5 visual-profile readiness sits in the 0.60–0.64 range for each.
Taxonomy bonus caps at 0.10–0.18.

This validates the burst runbook's Prerequisite #2 — authoring a per-company seed_company_deep/data/<slug>.py CompanySpec with hand-curated facts is the only way to push these five above 0.70 without misattributing generic industry data to the company.

Pre-existing bugs surfaced

enrichment_model + enrichment_host columns are still NULL on every row after enrichment. Same bug called out in the 10× fleet debrief (0YFNFkQTe3hptNBSkTqski). Fix is in enrich.py, not scoped here.

Impact on `/companies/[slug]`

Every one of the 15 newly-canonical /companies/[slug] pages now renders the shared OrgStatisticsBlock with real enrichment data (via the empty-state bypass added in PR #231), plus fresh 1024-dim Jina embeddings driving cosine-similar lookups elsewhere in the graph. No visible change needed to the frontend.

Infra follow-up

Local screen sessions llama-server (port 8080) and llama-embed (port 8081) are still running. To stop both:

screen -S llama-server -X quit
screen -S llama-embed  -X quit
pkill -f llama-server

Or leave them up if you expect to re-run enrichment soon.

Canonical-companies enrichment run — 2026-04-19

Infrastructure used

Sequence

Before → After

Score-ceiling finding

Pre-existing bugs surfaced

Impact on `/companies/[slug]`

Infra follow-up

Linked Intel

Related Intel

Infrastructure used

Sequence

Before → After

Score-ceiling finding

Pre-existing bugs surfaced

Impact on /companies/[slug]

Infra follow-up

Related Intel

Impact on `/companies/[slug]`