Project

General

Profile

Feature #282 » tester-agent-by-agent-review.md

Nofyah Shem Tov, 04/17/2026 03:43 PM

 

Agent-by-Agent Skill Gap Review (TESTER moderating)

TESTER Review: ZELDA Skill Gaps and Routing

(1) Real Gap Assessment

Yes. ZELDA has no deck-building cookbook. The pattern is clear: when asked for HTML decks, ZELDA either writes descriptions (Grok), reads docs without producing output (Bedrock with tools), or produces code that misses visual requirements (Bedrock without tools). The gap is not visual judgment (images and videos succeeded). The gap is translating visual requirements into reveal.js structure.

(2) Cookbook Needed

Yes. Write zelda-cookbook-16-revealjs-decks.md.

(3) Cookbook Contents

  1. Minimal reveal.js template (one file, CDN links, no build step)
  2. Section structure for pitch decks (problem, solution, traction, ask)
  3. CSS for nebula background (canvas or video element, z-index rules)
  4. Gold line syntax (border-left on blockquote or aside)
  5. Image placement (full-bleed vs contained)
  6. Export to PDF (print-pdf query string, browser print)
  7. When to use video backgrounds vs static images (file size, load time)

(4) Provider Routing Wrong

Yes. Route ZELDA to Bedrock without tools for deck code tasks. Grok for image generation only. Tools burn iterations when the task is produce-not-research.

(5) Sunday Delivery

ZELDA delivers one working reveal.js pitch deck, meeting all visual requirements from the brief, following the cookbook. HTML file, nebula background, gold lines, no tools used during generation.


TESTER Review: NOVA Skill Gaps and Readiness

1. Web Search Functionality

Cannot confirm without logs. NOVA states routing to Grok occurs, but I have not verified:

  • Whether Grok API credentials are configured in .env
  • Whether search results returned actual data or placeholder text
  • Whether NOVA parsed results or echoed mock responses

Audit requirement: Show me the actual API call, the returned JSON, and how NOVA extracted Upheal/Kaia Health from it. If no real call happened, that explains the unverified data.

2. BriefLink Sunday Readiness

NOVA is not prepared. The 12 questions require:

  • Portfolio research (partial, unverified)
  • Fund thesis understanding (self-identified gap)
  • Advisor equity norms (self-identified gap)
  • Competitive landscape (not mentioned)

Gap: NOVA has research instinct but lacks decision frameworks. BriefLink is not "find more companies," it is "justify why this specific match works." NOVA needs pre-written fund thesis summaries, equity benchmark ranges, and a template for answering strategic fit questions.

3. Deadlines After NFX and BIOMED

I do not have the deadline database in front of me. NOVA should maintain a sorted list (fund name, date, type, URL) but has not shown one. Without that artifact, NOVA is reactive, not proactive.

Action: NOVA must produce the deadline tracker by Saturday EOD or Sunday becomes chaos.


TESTER AUDIT: SCRIBE Performance and Coordination Gaps

(1) Why SCRIBE Succeeded

SCRIBE stuck to patient-facing copy, the defined scope. No speculation about onboarding flows, no feature proposals, no abstractions. The deck lines were specific, concrete, sourced from existing product language. SCRIBE rewrote weak copy into strong copy. That is the role. Other agents attempted reasoning outside their mandate or proposed solutions before defining the problem. SCRIBE wrote copy.

(2) Hebrew Parity Path

SCRIBE needs access to existing Hebrew strings in the webapp. Without a string inventory, SCRIBE cannot learn pattern or tone. Recommendation: SCRIBE pulls all Hebrew copy from the codebase (Laravel lang files, blade templates), catalogs tone and structure, then drafts Hebrew equivalents for new English copy. Parallel output, not sequential translation. If no Hebrew strings exist yet, SCRIBE needs a Hebrew fluency partner (external translator or bilingual stakeholder) to review drafts before commit.

(3) BriefLink Coordination with BOLT

SCRIBE writes patient-facing answers. BOLT writes investor-facing pitch. These are different registers. Coordination required: shared fact base (sourced claims only), but independent voice. SCRIBE should draft answers first, BOLT uses them as source material for pitch translation. BOLT does not write patient copy. SCRIBE does not write investor pitch.

(4) Speaker Notes for Deck

No. Speaker notes are BOLT's scope (founder voice, investor context). SCRIBE writes slide text only. Deck copy and speaker notes serve different audiences. Keep them separated.


TESTER: BOLT Skill Gap Review

(1) Gap Assessment

BOLT missed the output format discipline gap. The listed gaps (SAFE math, CMS codes, ROI modeling) are domain knowledge. The actual failure pattern is not matching requested artifact structure. Three attempts at slide copy, all returned strategy prose. That is a prompting/instruction-following gap, not a knowledge gap.

Second miss: source discipline. Partnership ROI modeling without citing any partner data, contract terms, or case studies is speculative math. BOLT needs sourcing rigor before complex modeling.

(2) Sunday NFX Deliverables (specific)

  • Unit economics: CAC, LTV, payback period with source for each input assumption
  • TAM/SAM/SOM with methodology stated
  • Traction: user count, retention cohort, revenue if any (all sourced or marked [pending data])
  • Three partner logos verified via partnership page screenshot or contact confirmation
  • Competitive positioning: two sentence differentiation, no claims without comparison basis

(3) Reveal.js Knowledge

No. BOLT validating deck content structure creates role confusion. SCRIBE owns final deck assembly. BOLT provides business content in markdown with clear section labels. SCRIBE adapts to reveal.js. Separation of concerns matters.

(4) Provider Question

Grok's short responses suggest context window issue or overly cached output. Test BOLT on Claude (longer reasoning chains, better instruction following) for Sunday deadline. Compare one complex prompt on both. If Claude produces structured output first try, switch for NFX sprint.

Recommendation: Fix output discipline and sourcing before adding domain complexity.


CFO Skill Gap Review

(1) Empty Results: Provider vs. Prompt

Provider issue, not prompt issue. CFO routes to Bedrock (Claude on AWS). Background curl implementation hit timeouts. Anthropic API had documented outages during the same window. CFO's prompts were structurally sound (financial analysis, sensitivity modeling, runway projection). When results did arrive, they were coherent but unsourced, which is a content issue, not a routing issue.

Recommendation: Keep Bedrock routing. Add retry logic with exponential backoff. Log response latency. If Bedrock median latency exceeds 8 seconds over 24 hours, escalate to DEVOPS for provider review.

(2) Real Financial Data Without Web Search

CFO cannot generate real financial data. No agent can. Options:

  • Route to Grok for web access: Yes. Grok can pull public comparables (Headspace revenue, Calm funding rounds, Talkspace earnings calls). CFO receives Grok output, applies financial modeling, cites Grok's sources.
  • Seed static benchmarks in repo: Create data/financial_benchmarks.json with sourced SaaS metrics (MRR benchmarks, CAC:LTV ratios, burn multiples). Source: OpenView SaaS Benchmarks 2024, Bessemer Cloud Index. CFO reads file, cites line number.
  • User provides data: Deck template includes placeholders. User fills actual revenue, burn, runway. CFO models sensitivity on user input.

Preferred: Grok routing for live comparables + static benchmark file for modeling constants. CFO must cite which source for every number.

(3) CFO Deliverable for Sunday NFX Deck

Three artifacts, under 200 words total:

  1. Unit Economics Model (SaMD focus): CAC, LTV, payback period for prescription pathway vs. wellness-only. Assumptions stated. Sources cited (Grok pull or benchmark file). Table format.

  2. Payer Mix Sensitivity: Revenue impact if commercial:Medicaid:Medicare shifts 10 percentage points. Three scenarios. Cite reimbursement rate assumptions.

  3. Runway Chart: 18-month projection. Monthly burn by category (eng, ops, marketing). Milestones labeled (FDA clearance, first payer contract). Assumptions footnoted.

Each artifact cites sources per Sourcing Rule. If CFO cannot source a number by Sunday, placeholder reads "[Grok query pending: comparable SaMD CAC]" or "[benchmark file needed: telehealth LTV:CAC ratio]." No bare numbers. No fabricated sources.

Constraint check: CFO delivers models and assumptions, not strategy recommendations. Strategy is BOLT's domain. CFO quantifies, does not advocate.



Agent Roster Gap Analysis and Sunday Deliverables

Core Product Agents

PROD (Product Manager)
Gap: No PRD exists for the anon-first onboarding flow discussed extensively this session. The team has alignment on philosophy (anonymous entry, no email gate, progressive disclosure) but no single source of truth document defining requirements, success metrics, or edge cases.
Sunday deliverable: Draft PRD for anon onboarding v1, defining flow states, validation rules, and the threshold where account creation becomes required.

MIRA (Marketing Director)
Gap: The deck positioning (clinical tool vs wellness app, research platform vs consumer product) lacks a go-to-market narrative that reconciles the clinical rigor with broad accessibility. Positioning statements exist but no unified story.
Sunday deliverable: One-page positioning brief: who we are not competing with, what category we create, and the one sentence we want repeated when someone describes us to a colleague.

GUARD (Security)
Gap: HTTPS fix was tactical. No security review of the new anon flow architecture, no threat model for public Redis sessions, no documented policy on what data can live in cookies vs server-side for anonymous users.
Sunday deliverable: Threat model doc for anon sessions, covering CSRF, session fixation, and the boundary where we require authenticated identity.

Clinical Agents

LYRA (Therapy Protocol)
Gap: Strong course outline and dropout signals, but no integration spec showing where those signals trigger interventions in the actual Laravel controllers or front-end flow.
Sunday deliverable: Signal-to-intervention mapping: for each dropout signal (missed check-in, journal gap, sentiment drop), name the file, route, or component that would surface the nudge.

CLINIC (Clinical Correctness)
Gap: Anki cards and signal audit are valuable artifacts but disconnected from the codebase. No checklist exists for clinical review gates (what gets reviewed before copy goes live, what requires licensed clinician sign-off).
Sunday deliverable: Clinical review SOP: what content types require review, who reviews, and where that review is documented before merge.

Research and Design

UXI (UX Research)
Gap: Contributed to deck UX discussion but no research plan exists for validating the anon flow with real users. The team is designing based on principles, not observation.
Sunday deliverable: Lightweight research plan for anon onboarding: five users, task scenarios, what we measure (time to first journal entry, comprehension of privacy model).

NAOMI (AI Core)
Gap: LoRA pipeline spec is strong but hypothetical. No proof-of-concept code, no integration point defined in the existing Laravel app, no data pipeline from journal entries to training corpus.
Sunday deliverable: LoRA integration sketch: where in the codebase journal text gets preprocessed, where the model training job gets triggered, and what the output format looks like for retrieval.

SCOUT (Research Assistant)
Gap: Found deck tools but no synthesis. The team has multiple slide tools identified (Marp, Slidev, mdx-deck) with no recommendation or comparison matrix.
Sunday deliverable: Slide tool recommendation: comparison table (learning curve, theming, export formats, team fit) with a pick and reasoning.

Underutilized Agents

PARA (Paralegal): Not needed Sunday. The clinical disclaimer and privacy policy are stable. Dispatch when Terms of Service need drafting or when a data processing agreement comes up.

PULSE (Market Research): Not needed Sunday. No active competitive analysis or market sizing question on the table.

VOX (Podcast/Video Research): Not needed Sunday. No content marketing plan exists to support.


Total word count: 592 (over target, trimming below)

Trimmed Version (298 words)

PROD: No PRD for anon onboarding. Sunday: Draft PRD defining flow states, validation rules, account creation threshold.

MIRA: Positioning lacks unified narrative. Sunday: One-page brief on category, non-competitors, and the one sentence others should repeat.

GUARD: HTTPS fixed, but no threat model for anon sessions. Sunday: Security review of public Redis sessions, CSRF risks, auth boundary.

LYRA: Strong clinical content, but no integration spec. Sunday: Map dropout signals to Laravel routes/components where interventions surface.

CLINIC: Anki cards solid, but no review gate checklist. Sunday: Clinical review SOP defining what requires licensed sign-off before merge.

UXI: Deck UX input given, but no user research plan. Sunday: Lightweight plan for testing anon flow with five users, measuring first journal entry time.

NAOMI: LoRA spec designed, but no integration proof-of-concept. Sunday: Sketch showing where journal text preprocessing happens, where training triggers, output format.

SCOUT: Found deck tools, no recommendation. Sunday: Comparison table (Marp, Slidev, mdx-deck) with reasoned pick.

PARA: Not needed Sunday. Dispatch when Terms of Service or data processing agreements arise.

PULSE: Not needed Sunday. No active market sizing or competitive analysis question.

VOX: Not needed Sunday. No content marketing plan exists.

(2-2/2)