Ask Dashboard - Real-time Q&A Monitoring

How Source Scoring Works
Latest (2026-03-23): Dual-model comparison — Model 1 + Model 2 dropdowns for side-by-side answer comparison, batch gen, and CSV export. | Dynamic OpenAI model selector. | (2026-03-21): TAH Fallback gate fix. | Rel Boost v2.2. | (2026-03-20): Rel Boost v2. | WCR simplification. | (2026-03-17): TAH Fallback v5.
0
0
Cited
Top-8 Pool
Prod
Alt6-Qual
Alt6-Raw
domain tally ▾ by subclass ▾
Domain Count DR
Date Range:
Custom: to
Auto-refresh ON
Last updated: Never
Non-Backstop
-
-
-
-
-
-
-
-
-
-
Backstop
-
-
-
-
-
-
-
-
-
-
Backstop Breakdown
Category/Classification Breakdown Tables

Category Breakdown

Classification Breakdown

Temporal Subclassification Breakdown

Alt6 Ranking Analysis (Temporal Decay vs Prod)

Ranking Metrics Settings ()

Pre-1970 Cited Articles (Missing Dates)
Data Quality
Missing/Zero Component Scores
Percentage of articles with null or 0 values
Component % Null/Zero Count Total
Semantic - - -
BM25 - - -
Cross Encoder - - -
Global Exclusion Criteria
Always Excluded Classifications: Adult Content, Conspiracy Theory, Gambling
Alt Score 6 Configuration
📊 Relevance reference pool: sources (cached at page load)
Session Reference Pool pCross, pBM25, and pSem are session-wide true percentiles computed from raw scores against the global pool (all sources across all questions, frozen at startup).

New questions arriving via polling are scored against this cached pool.

Pool size determines the granularity of RelPct rankings.
Scoring Formulas
Relevance = wCross × pCross + wBM25 × pBM25 + wSem × pSem
RelPct (rank) = percentile_rank(Relevance) over session pool
RelPct (rank) — Percentile Rank Percentile rank of this source's Relevance vs the session reference pool.

0.90 means this source's Relevance is higher than ~90% of pooled sources.

Not a probability; not an absolute "% relevant."

Computed using an empirical CDF (midrank for ties).
Alt6 = RelPct (rank) × DecayFactor × AnchorFactor × WindowFactor × TemporalCompat × RelBoost × EntityPresence
DecayFactor = max(floor, exp(−ln(2) × λ × ageInDays / halfLife))
EntityPresence = boost/penalty based on whether question entities appear in source text. Graduated by entity count: 1→0.60, 2→0.50, 3+→0.40 penalty. Boost: 1.20 (title) / 1.12 (desc) / 1.10 (content). Toggleable.
TAH Exception: All TAH subtypes (Event Anchored, Explicit Range, Comparison) — DecayFactor = 1.0 (recency decay disabled)
Reference-Aware Reroute: When enabled, EA/ER/Comparison/Fallback questions detected as reference-style lookups → AnchorFactor=1.0, WindowFactor=1.0, DecayFactor uses RefAware params (default t½=180d, floor=0.70). Gentle freshness preference — relevance dominates. Controlled by toggle + per-tile override.
Global Controls
(sum: 1.00 ✓)
Higher = faster decay
Lower = harsher penalty
Weight Sweep Find optimal wCross / wBM25 / wSem weights using intrinsic metrics
Entity Presence Boost ⓘ graduated boost/penalty by entity count
Weak-Cluster Rescue
Detects when the current top-ranked group is semantically weak relative to stronger buried alternatives. A buried source must pass 2 gates: (1) semantic > benchmark mean + lift (default 0.18), (2) semantic rank ≤ rank cap (default 5). Two hardcoded safety nets (P75 and absolute floor) provide backstop filtering. Qualifying candidates join a rescue pool with the benchmark and are reranked with semantic-led weights (75% semantic / 25% BM25 / 0% CE). Top sources from the reranked pool are sent to generation.
Trigger Gates
Rescue Behavior
Reranking Weights
(sum: 1.00 ✓)
Temporal Intent–Specific Decay: Parameters & Lab Results
Configure decay parameters and see their effects on Alt6 top-8 sources
How to read ▸
Subclass #Q Half-Life
(days)
Floor
(min decay)
Age → Floor
(days)
Floor-bound
Rate
Median
Decay
Age@95%
Floor
Spread
(P75–P25)
BRT - hrs
RWR -
EvtAnch/TAH - N/A Anchor scoring (t½=120d, floor=0.27) ×0.80
ExpRng/TAH - N/A Window compliance (t½=180d, floor=0.27) ×0.80
Comp/TAH - Recency decay (t½=150d, floor=0.40)
TAH Fallback Subtypes Unresolved TAH queries (no dates extracted) — per-subtype decay
FB_Fresh -
FB_Topical -
FB_Evergreen -
FB_Comparative -
CRBN -
RefAware -
Ref-Aware reroute

Ref-Aware reroute
UNKNW -
UNCLASSIFIED -
Temporal Compatibility ⓘ year-presence check (TAH only, tw primary)
Synthetic Window | "since" → today
Reference-Aware Reroute ⓘ reroutes reference-like temporal questions to gentle RefAware decay |
How Source Scoring Works

Alt6 Scoring Pipeline — Execution Sequence

Exact order of operations. Each numbered step completes before the next begins unless noted.

━━ PHASE A: SESSION-WIDE CACHES (startup only, locked after first run) ━━ A1. Compute altDecay raw values per source (needed by cache) A2. buildGlobalPercentileCache() Collect ALL sources (first 100 per question) across all questions. Build sorted arrays of raw scores: cross, bm25, sem, decay, dr, altDecay1–5. These arrays are the reference pool for all percentile computations. → LOCKED: globalPercentileCacheBuilt = true (never rebuilt) A3. buildRelevanceDistributionCache() For every source in the pool, compute Relevance via computeRelevance(): Relevance = wCross × pCross + wBM25 × pBM25 + wSem × pSem Sort all Relevance values → used for RelevancePct percentile lookups. → LOCKED: relevanceDistributionCacheBuilt = true (never rebuilt) ━━ PHASE B: PER-QUESTION SETUP (runs for each question) ━━ B1. getEffectiveDecayParams() → determine temporal subclass + cascade fallback B2. checkTahRecencyOverride() → if TAH anchor/window is recent → use BRT or RWR decay B3. extractTemporalCues() → extract target year(s) from question text / time window B4. checkOpenEndedRange() → detect start-only ER → compute synthetic window end B5. checkTahFallback() → no dates or end-only → route to fallback subtype B5b. Reference-Aware Reroute (if mode enabled): shouldRerouteToCrbn() → checks category, classification, anchor age, keywords Applies to: EA, ER, Comparison, and Fallback (TahFbFresh/Topical only) If triggered: → RefAware decay (no anchor/window, age-based decay t½=180d floor=0.70) Investigative questions: only rerouted when anchor age > Inv threshold (default 100yr) Per-tile override can force standard or force reroute regardless of detection. B6. classifyRelationshipQuery() → detect relationship entities + patterns (once per question) B7. Resolve effective subclass: override → reroute → fallback → cascade (in priority order) B8. computeQRanksForQuestion() → rank each source by Relevance within this question ━━ PHASE C: PER-SOURCE SCORING (runs for each source within each question) ━━ C1. Relevance signal percentiles (from global cache A2): pCross = getCrossPct(source) ← cross-encoder raw → global percentile (falls back to semantic with penalty if cross is null) pBM25 = getBm25Pct(source) ← BM25 raw → global percentile (0 if null) pSem = getSemPct(source) ← semantic raw → global percentile (0 if null) C2. Relevance = wCross × pCross + wBM25 × pBM25 + wSem × pSem If pCross is null (no cross-encoder AND no semantic) → source excluded (null). C3. RelevancePct = percentile_rank(Relevance) over session pool (from cache A3) Empirical CDF with midrank tie handling. Frozen at startup values. C4. Age & Decay: ageInDays = (questionAskedAt − published_at) in days DecayFactor = max(floor, e^(−ln2 × λ × ageInDays / halfLife)) Uses effective subclass curve from B7. RefAware reroute: uses RefAware curve (default t½=180d, floor=0.70) — gentle freshness preference. Standard TAH EA/ER: DecayFactor=1.0 (age decay disabled, anchor/window factors used instead). C5. TAH Temporal Factors (at most one is non-1.0 per question): AnchorFactor = anchor-centered decay (EvtAnch only, else 1.0) WindowFactor = window compliance (ExpRng only, else 1.0) TemporalCompat = year-presence check (TAH only, else 1.0) RefAware reroute: AnchorFactor=1.0, WindowFactor=1.0 (both disabled — relevance dominates). C6. RelBoost = relationship evidence multiplier (1.00 or 1.13) C6b. EntityPresence = boost/penalty based on question entity presence in source text (1.0 if disabled/no entities) C7. ASSEMBLY: Alt6 = RelevancePct × DecayFactor × AnchorFactor × WindowFactor × TemporalCompat × RelBoost × EntityPresence C8. Store all computed values on source object (pCross, pBM25, pSem, relevance, relevancePct, ageInDays, decayFactor, atFloor, relEvidence, temporalCompat, etc.) ━━ PHASE D: RANKING (per question, after all sources scored) ━━ D1. Sort all sources by Alt6 descending → assign altScore6_rank (1-based). Sources with null Alt6 get null rank. ━━ PHASE E: WEAK-CLUSTER RESCUE (per question, after D1 ranking) ━━ E1. Identify benchmark group = top-8 by Alt6 rank (from D1). E2. Compute semantic ranks across ALL sources (by raw semantic, descending). E3. Compute benchmark semantic stats: mean, P75, min, count. E4. Stage 1 — Weak Cluster Detection: For each non-benchmark source, check 3 gates: Gate 1: semantic > benchP75 (safety net — rarely binding) Gate 2: semantic > semFloor (safety net — 0.50 absolute floor) Gate 3: semantic > benchMean + lift (primary gate — default 0.18) Any pass → weak cluster detected. E5. Stage 2 — Candidate Qualification: Stage 1 passers must also have semRankAll ≤ rankCap (default 5). Top candidates (up to poolCap=4) by semantic → rescue pool. E6. If rescue_active: Pool = benchmark(8) + qualified candidates (up to 4) = up to 12 members. Rescue rerank with semantic-led weights: wcrAlt6 = (0.75×pSemLocal + 0.25×pBM25 + 0.00×pCross) × decay × anchor × window × compat × relBoost Sort pool by wcrAlt6 → top sendCount (default 6) are "sent". E7. OVERWRITE: Pool members' altScore6 replaced with wcrAlt6. RERANK: Pool members get ranks 1..poolSize by wcrAlt6. Non-pool members get ranks poolSize+1.. by original order. ━━ PHASE F: RECALCULATION (config change — differs from startup) ━━ Same as Phases B–E, BUT: • Phase A caches are NOT rebuilt (locked at startup values). • pCross, pBM25, pSem global percentiles do not change. • RelevancePct distribution does not change. • Only the weights (wCross/wBM25/wSem), decay params, and WCR config take effect. • Temporal grid is NOT recomputed — only expanded source sections re-render.

Alt6 Composite Score Formula

Alt6 = RelevancePct × DecayFactor × AnchorFactor × WindowFactor × Compat × RelBoost

Only one of AnchorFactor or WindowFactor is ever non-1.0 for a given question (they apply to different TAH subclasses). RelBoost applies only to relationship questions (1.00–1.13).

Relevance (3-signal blend)

Relevance = wCross × pCross + wBM25 × pBM25 + wSem × pSem

  • pCross = cross-encoder percentile (session-pool). Falls back to semantic with penalty if cross-encoder is null.
  • pBM25 = BM25 keyword percentile (session-pool). 0 if null.
  • pSem = semantic bi-encoder percentile (session-pool). 0 if null.
  • Defaults: wCross=0.75, wBM25=0.125, wSem=0.125. Configurable in Global Controls (weights should sum to 1.0).
  • RelevancePct = percentile rank of Relevance over the session pool (empirical CDF, cached at startup).

Exponential Half-Life Decay (BRT, RWR, CRBN, UNKNW)

For standard recency-based subclasses, decay measures how old a source is relative to when the question was asked:

DecayFactor = max(floor, e-ln2 × age_days / halfLife)

  • age_days = days between source publication and question ask time
  • halfLife = the number of days at which the factor drops to exactly 0.50 (50%)
  • floor = minimum factor — even very old sources get at least this weight
  • Example: With halfLife=7d, floor=0.20: a 7-day-old source gets factor=0.50, a 14-day-old source gets factor=0.25, a 21-day-old source gets factor=0.20 (floor)

EvtAnch/TAH — Anchor-Centered Scoring

For event-anchored questions (e.g., "What happened in the 2024 election?"), sources are scored by proximity to the event date, not by recency:

AnchorFactor = max(floor, e-ln2 × |pub_date - anchor_date| / halfLife)

  • anchor_date = the event date (event_date preferred, fallback to tw_start)
  • |pub_date - anchor_date| = distance in days between source and event (absolute value — before or after)
  • DecayFactor is set to 1.0 (no recency decay) for TAH questions
  • Example: With halfLife=10d, floor=0.30: a source published on the event date gets 1.0, 10 days away gets 0.50, 20 days away gets 0.30 (floor)

ExpRng/TAH — Window Compliance Scoring

For explicit-range questions (e.g., "African footballers August 2025"), sources are scored by whether they fall within the requested time window:

WindowFactor = 1.0    (if pub_date is within [tw_start, tw_end])
WindowFactor = max(floor, e-ln2 × d_boundary / halfLife)    (if outside)

  • tw_start, tw_end = time window boundaries from classifier
  • d_boundary = distance in days to the nearest window boundary (not the midpoint)
  • Sources inside the window always get factor=1.0 — no penalty for position within the window
  • Boundary-inclusive: sources published on tw_start or tw_end are considered in-window
  • Position labels: IN = inside window, BEF = before window, AFT = after window, UNK = unknown (estimated date)
  • Example: Window [2025-08-01 to 2025-08-31], halfLife=180d, floor=0.27: a source from Aug 15 gets 1.0 (in-window), one 180d before start gets ~0.50, one 2yr out gets ~0.27 (floor)

Estimated-Date Penalty (EvtAnch + ExpRng)

Some sources have no real published_at date. The backend falls back to article_inserted_at (the crawl/insert date) and marks published_at_estimated = true. These crawl dates have no meaningful temporal relationship to the content.

For TAH temporal scoring (EvtAnch and ExpRng), estimated-date sources still go through normal distance-based computation (using the crawl date), then receive a multiplicative uncertainty penalty on top:

Factor = normalFactor × (1.0 - estimatedDatePenalty)

  • Default penalty: 0.20 (multiplier = ×0.80) — configurable per TAH type in the "Est. Penalty" fields above
  • Tooltips show × EstPenalty(0.80) alongside distance-based factors
  • ExpRng position badge: UNK instead of IN/BEF/AFT
  • Audit panes track unknown-date sources as a separate bucket — the "Unknown" position in ExpRng charts and "A6 Unk" / "P Unk" columns in both audit grids
  • Anchor distances and boundary distances exclude estimated-date sources (they would be meaningless)

TAH Recency Override (BRT/RWR)

When a TAH question's anchor/window is very recent, BRT or RWR decay is used instead of anchor/window scoring:

  • For EvtAnch: recency measured from anchor date (tw_start) to ask time
  • For ExpRng: recency measured from window end date (tw_end) to ask time
  • If recency ≤ BRT age-to-floor → BRT override (labels: EA→BRT, ER→BRT)
  • If recency ≤ RWR age-to-floor → RWR override (labels: EA→RWR, ER→RWR)
  • Override disables anchor/window factor — standard recency decay applies instead
  • Thresholds derived from BRT/RWR config (change half-life/floor → thresholds update automatically)
  • Purpose: protect recent-news questions while allowing more forgiving TAH baseline parameters for genuinely historical events
  • Shown as blue badges in the UI

Relationship Boost v2 (All Subclasses)

When a question asks about the relationship between entities (influence, involvement, evolution, comparison, causality), sources are scored based on evidence of that relationship:

  • Entity extraction (v2 — precision-first): Three extraction sources combined: (1) comparison templates ("X vs Y", "between X and Y"), (2) proper-noun phrases (up to 6 raw candidates with mixed-span splitting), (3) alias/gazetteer scan (~100-entry map of teams, universities, events, agencies). Candidates validated against reject patterns (durations, verb phrases, fragments). At most 2 entities selected.
  • Alias canonicalization: Known aliases (e.g., "Pats" → "New England Patriots", "Oscars" → "Oscars") resolved via longest-prefix match. Alias hits prioritized during entity selection and always pass validation.
  • Best-pair matching: Evaluates all entity pairs per source — picks the pair with best co-occurrence + relationship evidence. Alias-aware matching: checks all aliases resolving to the same canonical name. Tail-word matching for 3+ word entities.
  • Scoring — neutral/boost only: Co-mention-only → neutral (1.00). Both entities + any relationship evidence term → boost (1.13). No penalty tier, no compatibility matrix, no weak/strong distinction.
  • Only activates for questions with 1+ extractable entity and a detected relationship pattern
  • Diagnostic: diagnoseRelationshipBoost() in browser console — reports extraction modes (pair/single/none), boost/neutral tier counts, top-8 impact

Weak-Cluster Semantic Rescue (WCR)

Two-stage rescue system that detects when the benchmark top-8 contains a semantically weak cluster, then promotes high-semantic buried candidates via rescue-state reranking.

  • Stage 1 — Weak Cluster Detection (2 active gates + safety nets): For each source NOT in the benchmark top-8, checks: (1) semantic > benchmark mean + lift (default 0.18 — primary gate), (2) semantic rank across all sources ≤ rank cap (default 5). Two hardcoded safety nets (P75 gate at 75th percentile, absolute floor at 0.50) provide backstop filtering but are set permissively and rarely bind. All checks must pass. If any buried source passes → weak cluster detected.
  • Stage 2 — Rescue Pool & Reranking: Rescue pool = benchmark top-8 + qualified candidates (up to pool cap, default 4). Pool members are re-scored with semantic-led weights: wcrWtSemantic × pSemanticQueryLocal + wcrWtBM25 × pBM25 + wcrWtCross × pCross (defaults: 0.75/0.25/0.00). Top sources by rescue score become the new ranking.
  • Three outcomes: no_weak_cluster (no buried source passes gates), weak_cluster_no_candidates (gates passed but rank cap filtered all), rescue_active (pool formed and reranked)
  • Integrated ranking: When rescue is active, pool members' altScore6 values are overwritten with rescue scores. Non-pool sources keep their base scores. This is the default view — a "base view" toggle is available for diagnostics.
  • Send count: Rescue-active questions send top-6 (configurable) to LLM instead of top-8
  • Null-safe: Sources with null semantic are excluded from benchmark stats computation (mean, P75) and never qualify as candidates. Requires minimum 4 non-null semantic values in benchmark (internal threshold).
  • Diagnostics: CSV columns wcr_state, wcr_in_benchmark, wcr_weak_cluster, wcr_is_candidate, wcr_in_pool, wcr_pool_rank, wcr_sent_to_llm, wcr_gate_result, wcr_semantic, wcr_sem_rank_all, wcr_bench_mean_sem, wcr_bench_p75_sem, wcr_bench_sem_count, wcr_sem_minus_mean, wcr_sem_minus_p75, wcr_rel_pct, wcr_alt6, wcr_base_alt6, wcr_base_rank
  • Toggle on/off via the "Weak-Cluster Rescue" panel in the Alt6 config section. Default: ON (as of 2026-03-20)

Key Concepts

  • Half-life controls decay speed: larger = slower decay, smaller = faster. At exactly one half-life distance, the factor equals 0.50.
  • Floor prevents total suppression: even sources far from the target time contribute at minimum this weight.
  • Estimated-Date Penalty handles sources with no real publication date — they remain eligible but receive a mild fixed penalty instead of meaningless distance-based scoring.
  • Temporal Compatibility (TAH only) — checks whether target year(s) appear in article title/description. Year source priority: time window (tw_start/tw_end) is primary; query text regex is fallback. Cross-year windows (e.g. 2019–2020) produce multiple valid target years. Three tiers: any target year found → Match Boost (default 1.15); different year found → Mismatch Penalty (default 0.80); no year → 1.0 neutral. Only applies to TAH subclasses (event_anchored, explicit_range, comparison) — BRT/RWR/CRBN already have recency decay. Both values tunable in config row above.
  • TAH Fallback Subtypes — triggers when no dates are extracted (both tw_start and tw_end null), or when only tw_end is present (end-only ER). Classified into 4 subtypes with subtype-specific decay: FRESH (t½=18d), TOPICAL (t½=50d), EVERGREEN (t½=90d), COMPARATIVE (t½=120d). Labels combine prefix + subtype: EA→FB/Fresh, ER→FB/Topical, etc. Each appears as its own row in the Temporal Subclass Breakdown stats table.
  • Synthetic Window / Since — for start-only ER questions (tw_start present, tw_end missing). Two modes:
    • SynWin — synthetic end date at tw_start + (today − tw_start) × windowPct. Default 20%. Configurable. Label: ER→SynWin.
    • SinceNow — if question contains "since" + a year/date, window extends to today. Label: ER→SinceNow.
    In-window = 1.0, outside-window penalty applies. Shown as indigo badge.
  • All date comparisons use calendar-day granularity (hour/minute differences are ignored).

Entity Presence Boost (All Subclasses)

When a question contains extractable person names (full names like "Trey Anastasio"), sources are scored based on whether those entities appear in the source text:

  • Entity extraction: Sequences of 2+ capitalized words extracted from question text. Common stop words excluded. Full name match required (not partial).
  • Boost tiers: All entities found in title → ×1.20, in description → ×1.12, in article content → ×1.10.
  • Graduated penalties by entity count:
    • 1 entity, none found → ×0.60
    • 2 entities, 1 of 2 found → ×0.90, none found → ×0.50
    • 3+ entities, majority found → ×0.95, minority → ×0.70, none → ×0.40
  • Partial matches: Some but not all entities found → graduated neutral/mild penalty.
  • Content check: Title/description checked first (fast). If no match found, article content fetched asynchronously from article_contents for degraded conditions. In prod, content check runs synchronously for all sources.
  • Toggleable: Can be enabled/disabled via the Entity Presence Boost config checkbox. Applies to both standard Alt6 and Alt6-R (WCR rescue) scoring.
  • Formula integration: Alt6 = RelevancePct × DecayFactor × ... × EntityPresence

Complete Scoring Formula

Alt6 = RelevancePct × DecayFactor × AnchorFactor × WindowFactor × TemporalCompat × RelBoost × EntityPresence

Where exactly one temporal mechanism is active (Decay, Anchor, or Window), and EntityPresence is the entity name-match multiplier (1.0 when disabled or no entities extracted).

Relevance distribution: Not yet built (will build at startup)
Alt6 Top-8 Comparison by Temporal Subclass — Raw (all sources) vs Qualified (excluding ineligible)
Waiting for data...
Source Metrics Scatter Plot (All Questions)
Answer Generation & Export Generate Alt6 Qual/Raw answers and export comprehensive CSV
Exports prod answers and source-level scoring data for filtered questions. No answer generation needed.
Source Quality: With vs Without Backstop Sources
Initializing dashboard...

How to Use the Ask Dashboard

Last Updated: March 20, 2026

Overview

This dashboard monitors questions asked through the Ask module in real-time. It displays questions, answers, source articles, named entities, and performance metrics.

Header Controls

  • Auto-refresh: Toggle to enable/disable automatic data refresh
  • Interval: Set how often the dashboard refreshes (1-15 minutes)
  • Last updated: Shows when data was last fetched

Summary Statistics

  • Total Questions: Number of questions in the current view
  • Backstop Queries: Questions that required fallback processing
  • Unique Cited Domains: Count of distinct domains actually cited in answers. Hover over this box to see a breakdown showing each domain, how many times it was cited, and its average Domain Reliability (DR) score.
  • Last updated: Shows when the dashboard last checked for new questions

Date Range Selection:

  • Preset Buttons: Quick selection options - Today, Yesterday, Last 7/15/30/60/90 days (default: Last 15)
  • Custom Date Range: Enter specific start and end dates, then click "Apply" to filter. Click "Clear" to return to preset mode.
  • Mutual Exclusivity: Preset buttons and custom dates are mutually exclusive - selecting one disables the other.
  • Auto-refresh Indicator: Shows whether auto-refresh is enabled (green dot = ON, gray dot = OFF).

Auto-refresh Behavior:

  • Auto-refresh ON: Today, Last 7, Last 15, Last 30, Last 60, Last 90 days (live monitoring)
  • Auto-refresh OFF: Yesterday and Custom date ranges (historical data, no updates expected)

Performance metrics are split into two rows:

  • Non-Backstop row (green): Averages for queries answered without fallback
  • Backstop row (red): Averages for queries that required fallback processing

Each row shows: Total, Search, Answer, Suggest, Other, Words, W/Sec, Cites, and Avg Score (average article score of cited sources).

Note: Click section headers to expand/collapse the Category/Classification Breakdown Tables and Alt6 Ranking Analysis sections.

Category / Classification / Temporal Subclass Breakdown tables:

All three tables share the same column structure, grouped by category, classification, or temporal subclass respectively. Updated 2026-03-12.

  • Name: The grouping key (category, classification, or temporal subclass)
  • #: Number of questions in that group
  • Prod Top 8 DR: Average domain reliability of top 8 sources by prod rank
  • Prod Cited DR: Average domain reliability of sources cited in the prod answer
  • Alt6 Qual: Average altScore6 (relevance × decay) of the top 8 Alt6 Qual sources (exclusion-filtered)
  • Qual Top 8 DR: Average domain reliability of the top 8 Alt6 Qual sources
  • Alt6 Raw: Average altScore6 of the top 8 Alt6 Raw sources (null DR allowed, below-threshold DR blocked)
  • Raw Top 8 DR: Average domain reliability of the top 8 Alt6 Raw sources
  • Qual Cited: (appears when Qual answers have been generated) Average altScore6 of sources cited in the generated Qual answer
  • Raw Cited: (appears when Raw answers have been generated) Average altScore6 of sources cited in the generated Raw answer

Click the sort arrows on any column header to sort the table. Default sort is by count descending.

Source Quality: With vs Without Backstop Sources:

Compares Alt6 Qual and Alt6 Raw top-8 metrics when backstop articles (source_origin 8/9/10) are excluded vs included in the source pool. Updated 2026-03-12.

  • Excluding BS Sources: Top 8 picked after removing backstop articles from the pool, then applying Qual or Raw filtering.
  • All Sources: Top 8 picked from the full pool including backstop articles.
  • Qual columns (Score, Rel, Decay, DR): Metrics for top 8 Alt6 Qual sources (full exclusion filtering).
  • Raw columns (Score, Rel, Decay, DR): Metrics for top 8 Alt6 Raw sources (null DR allowed, below-threshold DR blocked, classification excluded).
  • Delta row: All Sources minus Excluding BS. Green = backstop improves metric, Red = backstop dilutes.

Alt6 Ranking Analysis (Temporal Decay vs Prod):

Compares how Alt Score 6 (temporal decay scoring) ranks sources differently from Production. Alt6 applies subclass-specific half-life decay to relevance scores, which can significantly change rankings for time-sensitive content.

Two-Tier Ranking System: Alt6 ranking uses the same two-tier approach as Production ranking:

  • Tier 1 (Top positions): Non-excluded sources ranked by Alt6 score (highest = rank 1)
  • Tier 2 (Bottom positions): Excluded sources ranked by Alt6 score among themselves
  • An excluded source with a high Alt6 score will always rank below a non-excluded source with a lower Alt6 score
  • Exclusion criteria: backend is_excluded flag, classification (adult/conspiracy/gambling), and user-configured thresholds (DR, Score, Semantic, etc.)

Ranking Shift Metrics:

  • Avg absolute change: Average number of slots sources moved (regardless of direction)
  • Avg improvement/worsening: Average slot change for sources that improved/worsened
  • % improved/worsened: Percentage of sources that moved up/down in ranking
  • Bottom 1/3 -> Top 1/3: Sources that jumped from bottom third to top third
  • Top X dropped D+ slots: Top-ranked sources that dropped significantly
  • Bottom 50% -> Top X: Sources from bottom half that made it to top X (only for questions with 50+ sources)
  • Jumped/Dropped K+ slots: Sources with significant rank changes

Configure thresholds (Top X, K, D) in Ranking Metrics Settings. Summary shows percentages; per-question shows raw counts.

Ranking Shift Metrics (per question):

Expand this section on any question card to see detailed ranking shift metrics for that specific question, comparing Alt6 vs Production ranking.

Master Article Decay Score Settings:

Configure alternative decay score formulas. The decay formula is: Decay Score = Base Score × e^(-(λ × (t^p)))

  • Prod Decay Score: Read-only display of production values (λ=0.03, p=1.25)
  • Alt Decay Score 1 & 2: Configure custom decay formulas

Variables (nested dependencies):

  • Base Score: Semantic/embedding score (always independently selectable)
  • e (Euler's number): Constant 2.71828 - enables exponential decay
  • ^(-(exponent)): Negative exponential (requires e to be checked)
  • λ (Decay Rate): Controls decay speed (requires ^(-(exponent)))
  • t (Days Since Publication): Time variable (requires ^(-(exponent)))
  • p (Power): Exponent for time (requires t)

Check the Save checkbox to persist your configuration across page refreshes.

Alt Decay Columns (AD1 & AD2):

When Alt Decay Score 1 or 2 is enabled, two new columns appear in the Sources table: AD1 and AD2. These columns display:

  • Computed Value: The decay score calculated using your configured formula (e.g., 0.847)
  • Formula Breakdown: Shows the actual values used in the computation (e.g., 0.923 × e^(-0.05 × 14.5^1.5))
  • Column Header Tooltip: Hover over AD1/AD2 header to see the configured formula

Key Details:

  • Alt Decay uses the raw semantic score (not percentalized)
  • Days since publication (t) is calculated from source's published_at vs question's asked_at
  • Columns update immediately when you change configuration values while toggle is ON
  • When toggle is OFF, columns display "-" for all sources

Show PCT Mode:

  • When "Show PCT" is enabled, Alt Decay columns display percentalized values (0-1 range)
  • Percentalization uses min-max normalization across all sources within the question
  • Column headers show "(%) " suffix when PCT mode is ON
  • Highest Alt Decay value becomes 1.0, lowest becomes 0.0
  • In PCT mode, formula breakdown is hidden (only the percentile value is shown)

Filters

  • User: Filter questions by specific user
  • Classification: Filter by question type (Investigative, Temporally-Aware, etc.)
  • Category: Filter by content category (politics, business, sports, etc.)
  • Backstop: Filter by backstop status (All, Non-Backstop, or Backstop only)

All metrics and statistics update dynamically based on the current filter selection.

Question Cards

Each question is displayed as a card with expandable sections:

  • Answer: The generated response with citation numbers (tabs for Alt Answers in header)
  • Sources: Source articles used (click headers to expand)
  • Suggestions: Follow-up questions
  • Named Entities: People, organizations, locations identified
  • Performance Metrics: Timing data for the query
  • Ranking Shift Metrics: How Alt Score rankings compare to Prod rankings

Answer Section (Three-Column Layout):

The Answer section displays three columns side-by-side for easy comparison:

  • Answer (Column 1): The production answer from the original query
  • Alt Answer 1 (Column 2): Generate an alternative answer using top 8 sources ranked by Alt Score 1
  • Alt Answer 2 (Column 3): Generate an alternative answer using top 8 sources ranked by Alt Score 2

How to use the Answer Section:

  • Expand/Collapse: Click anywhere on the Answer header to toggle the section
  • Side-by-Side Comparison: All three answers are visible simultaneously when expanded
  • Vertical Dividers: Columns are separated by vertical lines for clarity
  • Equal Width: Each column takes 33% of the width

How Alt Answers work:

  • Alt Score 1 or 2 must be configured in Master Article Score Settings to enable generation
  • Click "Generate Alt Answer" to call the API with the top 8 sources by Alt Rank
  • Two-tier ranking ensures the top 8 sources are always non-excluded (excluded sources are ranked at the bottom)
  • The answer is generated using the same question but with different source articles
  • Generated answers are cached for the session (not persisted across page refreshes)

Question Badges

CATEGORY Content category (politics, sports, etc.)
CLASSIFICATION Question type classification
BACKSTOP [codes] Question required fallback processing. Reason codes (e.g. [1000, 1003]) indicate which criteria triggered the backstop call. Hover over the badge to see all 7 criteria with triggered ones highlighted.
Avg Prod Cited Score: 0.XXX Average production score of cited sources only
Avg Prod Top 8 Score: 0.XXX Average production score of top 8 sources by rank (or all sources if fewer than 8)
Avg Alt1 Top 8 Score: 0.XXX Average Alt1 score of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt2 Top 8 Score: 0.XXX Average Alt2 score of top 8 sources by Alt2 rank (n/a if Alt2 not configured)

Sources Section Badges

These badges appear in the Sources section for each question:

Avg Prod Cited Domain Reliability: 0.XXX Average domain reliability of cited sources only
Avg Prod Top 8 Domain Reliability: 0.XXX Average domain reliability of top 8 sources by rank
Avg Prod Cited Readability: 0.XXX Average readability of cited sources only
Avg Prod Top 8 Readability: 0.XXX Average readability of top 8 sources by rank
Avg Alt1 Top 8 Domain Reliability: 0.XXX Average DR of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt1 Top 8 Readability: 0.XXX Average readability of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt2 Top 8 Domain Reliability: 0.XXX Average DR of top 8 sources by Alt2 rank (n/a if Alt2 not configured)
Avg Alt2 Top 8 Readability: 0.XXX Average readability of top 8 sources by Alt2 rank (n/a if Alt2 not configured)

Sources Table

The sources table shows articles considered for the answer:

AgeTime since article was published
RnkArticle rank (lower = higher relevance)
ScoreCombined relevance score (0-1)
SemSemantic similarity score
BM25Keyword matching score
CrossCross-encoder relevance score
DRDomain Reliability score (0-100)
DepthContent Depth Score (0-5)
PosPositive Sentiment
NegNegative Sentiment
SentSentiment summary (sum of the positive and negative sentiment scores)
ReadReadability score
DecayTime adjusted semantic score
ClassArticle classification
ExclWhether the content is ignored

Source Row Colors

Pink background: Top-8 article NOT cited in the answer
White background: Article cited in the answer OR ranked below 8

Score Colors

0.600 High score (>= 0.5)
0.350 Medium score (0.2 - 0.5)
0.100 Low score (< 0.2)

Keyboard Shortcuts

  • Escape: Close this help dialog

LLM-as-Judge Evaluation

This feature uses OpenAI's GPT-4o-mini to evaluate and compare answer quality. Located between the Answer and Sources sections in each question card.

How to Use:

  • Expand the "LLM-as-Judge Evaluation" section
  • Click "Evaluate Answers" button to trigger evaluation
  • Evaluation includes the Production answer, plus Alt Answer 1 and/or Alt Answer 2 if generated

Prompt Modes:

  • Original — Standard evaluation prompt with 7 criteria
  • Alternate-1 (Core Ask Aware) — Adds a "Core Ask" reasoning step that forces the judge to identify the specific relationship/action/event before scoring. Strengthens relevance criterion (score capped at 3 if core ask not addressed) and evidence utilization (prefers evidence addressing the core ask). Comparison guidance penalizes answers that are merely broader/safer without answering the question.
  • Alternate-2 (Core Ask Gate) — 5-step evaluation scaffold: (1) Extract the Core Ask with concrete-involvement framing and minimum facts, (2) Core Ask Summary returned in JSON (max 20 words — visible as a chip in the UI), (3) Coverage Check classifying each answer as direct/partial/missed (returned in JSON — visible as color-coded badge per answer), (4) Scoring Constraint with HARD binding caps on relevance (direct → 1-5, partial → ≤4, missed → ≤2) and evidence_utilization (non-core-ask evidence → ≤4, mostly contextual → ≤3), (5) Winner Constraint requiring direct-coverage answers to beat missed-coverage answers unless factual errors. Stronger comparison guidance that prohibits preferring broader/safer answers. Designed for cases where Alt-1's soft cap is insufficient.

Select prompt mode in the batch eval controls. Switching modes counts as a settings change. Each result records which prompt mode was used. The Prompt Comparison section supports pairwise comparison across all mode pairs.

Evaluation Criteria (1-5 scale):

Relevance & Task Alignment (25%)Does the response directly address the question, cover all sub-questions, and avoid topic drift?
Faithfulness / Groundedness (20%)Are all factual claims supported by the retrieved sources without hallucination?
Claim–Evidence Mapping (15%)Can each claim be explicitly mapped to supporting source evidence?
Temporal Correctness (15%)Does the response correctly align with the temporal scope of the question and sources?
Evidence Utilization (10%)Does the response incorporate all key relevant evidence from the sources?
Internal Consistency (10%)Is the response free of logical contradictions and conflicting assertions?
Appropriate Uncertainty (5%)Does the response appropriately qualify uncertainty when evidence is insufficient?

Score Scale:

4 (VeryGood) Excellent performance on this criterion
3 (Good) Adequate performance with minor issues
2 (Bad) Noticeable problems affecting quality
1 (VeryBad) Significant failures on this criterion

Per-Criterion Reasoning:

Each criterion includes a detailed reasoning (2-3 sentences) explaining why that specific score was assigned. The reasoning appears below each criterion score in italic text.

General Commentary:

Each answer receives a holistic assessment with specific examples and observations. This appears at the bottom of each score card in a highlighted box.

Comparison Analysis:

When multiple answers are available (Prod + Alt1 and/or Alt2), the LLM provides a comparative analysis explaining which answer is best and why, considering both content quality and source relevance.

Source Context:

The evaluation considers the top 8 ranked sources (title, description, and fragment) for each answer type, using Production ranking for the Prod answer and Alt Score ranking for Alt answers.

Formulas

Article Score

The article score is calculated differently based on whether the content is static (e.g., Wikipedia) or has a publication date.

For Static Content:

score = 0.50 × score_semantic_pct + 0.15 × score_bm_25_pct + 0.35 × score_cross_pct
Where:
score_semantic_pct = semantic similarity score (percentile-normalized)
score_bm_25_pct = BM25 keyword matching score (percentile-normalized)
score_cross_pct = cross-encoder score (percentile-normalized)
Weights: Semantic: 50% • BM25: 15% • Cross-encoder: 35%

For Non-Static Content (with publication date):

score = 0.15 × score_semantic_pct + 0.35 × score_decay_pct + 0.15 × score_bm_25_pct + 0.35 × score_cross_pct
Where:
score_semantic_pct = semantic similarity score (percentile-normalized)
score_decay_pct = time-decayed score (percentile-normalized)
score_bm_25_pct = BM25 keyword matching score (percentile-normalized)
score_cross_pct = cross-encoder score (percentile-normalized)
Weights: Semantic: 15% • Time decay: 35% • BM25: 15% • Cross-encoder: 35%

Decay Score

Time decay adjusts the semantic score based on article age, giving more weight to recent content.

score_decay = base_score × e(-λ × tp)
Where:
score_decay = the adjusted score after applying time decay
base_score = the original semantic score before decay
e = Euler's number (≈ 2.71828)
λ = decay rate = 0.03 day-1
t = time since publication in days
p = power = 1.25

Alt6 Scoring (Temporal-Aware Relevance)

Alt6 is an experimental scoring formula designed for temporally-aware questions. It combines relevance with a temporal decay factor based on the question's temporal subclass.

Formula:

Alt6 = RelPct × DecayFactor
Where:
RelPct = Percentile rank of the source's Relevance across all sources in the session pool
DecayFactor = Time decay based on temporal subclass (see below)

Relevance Calculation:

Relevance = w_cross × p_cross + (1 - w_cross) × p_bm25
Where:
w_cross = 0.85 (default, configurable in Alt6 settings)
p_cross = score_cross_pct (cross-encoder percentile)
p_bm25 = score_bm_25_pct (BM25 percentile)

Temporal Subclasses & Decay Parameters:

BRTBreaking/Recent Topics - Half-life: 1 day, Floor: 0.05
RWRRecent Window Required - Half-life: 7 days, Floor: 0.10
EvtAnch/TAHTAH Event Anchored - DecayFactor = 1.0; uses anchor-centered temporal factor: AnchorFactor = max(0.30, exp(-ln2 × |pub_date - anchor_date| / 10d)). Sources closer to the event date score higher.
ExpRng/TAHTAH Explicit Range - DecayFactor = 1.0 (no decay). Future: in-window compliance scoring.
Comp/TAHTAH Comparison - DecayFactor = 1.0 (no decay). Future: multi-window balance scoring.
CRBNContext-Rich Background News - Half-life: 30 days, Floor: 0.20
UNKNWUnknown - Half-life: 14 days, Floor: 0.15

Decay formula: DecayFactor = max(floor, 0.5^(age_days / half_life))

EvtAnch/TAH anchor formula: AnchorFactor = max(0.30, exp(-ln2 × |pub_date - anchor_date| / 10d)). Symmetric around anchor date.

Unknown Publish Date Handling:

Sources with missing or invalid publish dates (age >10 years detected as anomaly) use subclass-specific fallback decay factors:

BRT/RWRFloor value (penalize unknown for time-sensitive questions)
TAH1.0 (date irrelevant for timeless content)
CRBN0.70 (moderate - reference content more forgiving)
UNKNWFloor value (conservative default)

Sources using fallback show "(no date)" indicator in purple.

UI Indicators:

  • Rel(X.XXX)→RelPct(X.XX) - Shows raw Relevance score and its percentile rank in the session pool
  • q-rank: N/M - Shows the source's rank within this question (N of M sources, sorted by Relevance descending)
  • ⓘ icons - Hover for detailed breakdown tooltips showing raw values and calculations
  • Session pool - Displayed in the Alt6 config panel, shows total sources used for percentile calculation
  • "inputs missing" - Shown when a source lacks cross_pct or bm25_pct values needed for Alt6
  • Decay badge - Shows temporal subclass and decay parameters near the question/answer

Score Deltas:

  • ScoreΔ = Alt6 - ProdScore (positive = Alt6 scores higher)
  • RankΔ = ProdRank - Alt6Rank (positive = Alt6 ranks the source higher)

Relevance Health Diagnostic

A per-question diagnostic that summarizes the absolute strength of the retrieved candidate set using RAW Relevance (not RelPct), to answer: "Do we have enough strong sources, or should we retrieve more / broaden search?"

Status Levels:

🟢 Strong max ≥ 0.75 AND count(Relevance ≥ 0.70) ≥ 3
🟡 Thin max ≥ 0.65 AND count(Relevance ≥ 0.60) ≥ 2
🔴 Weak — expand sources Otherwise (candidate set may need broader search)

Detail Panel (click to expand):

  • Counts: N total sources, N valid (with pCross and pBM25)
  • Distribution: max, p75, median of Relevance
  • Strength: Count of sources ≥0.70, ≥0.60, ≥0.50

Note: This diagnostic uses raw Relevance (time-independent signal: meaning + keywords) to detect when the candidate set is weak even if ranks or percentiles look fine. It does not affect ranking or selection.

Decay Lab Results

Diagnostics showing how temporal decay is behaving across each subclass. Helps tune Half-life and Floor parameters safely.

Metrics (per subclass):

  • Floor-bound rate: Percentage of sources at the decay floor ("as old as we'll treat them")
  • Median decay: Typical DecayFactor value—shows how strongly time affects results
  • Age@95% floor: Age by which 95% of floor-bound sources have reached the floor
  • Spread (P75–P25): Range of DecayFactor values—indicates decay curve diversity

Status Indicators:

🟢 Normal Metric is within expected range for this subclass
🟡 Warning Metric is slightly outside expected range
🔴 Concerning Metric is far outside expected range—consider adjusting parameters

Tuning Tips:

  • High floor-bound rate: Try increasing half-life or lowering floor
  • Low floor-bound rate: Try decreasing half-life or raising floor
  • Median decay too low: Try increasing half-life or raising floor
  • Median decay too high: Try decreasing half-life or lowering floor

Note: All TAH subtypes (Event Anchored, Explicit Range, Comparison) have decay disabled (DecayFactor = 1.0 always). Input fields show "Typical" ranges below them, with warnings for values outside recommended bounds.

Batch LLM Evaluation

Run LLM-as-Judge evaluation across multiple questions in one batch. Located between the analytics sections and the question list.

  • Evaluation Weights: Set per-criterion weights (must sum to 100%). These weights are applied to all questions in the batch.
  • Batch Size: Choose how many questions to evaluate (1, 5, 10, 20, or 30).
  • Source Scope: "Eligible sources only" uses DR-filtered Alt6 Qual sources; "All sources" uses unfiltered Alt6 Raw sources.
  • Start Evaluation: Picks the next N uncomputed questions from the current filtered view. For each question: generates an Alt6 answer, then runs LLM-as-Judge on both the production answer and the Alt6 answer.
  • Non-recompute: Questions already evaluated in the Last Run with the same weights and source scope are skipped.
  • Settings Change Warning: If you change weights or source scope, a confirmation dialog appears before starting.
  • Summary Grid: Shows all evaluated questions with scores, deltas, and metadata. Click a question to scroll to its card and auto-expand the LLM Judge section.
  • Breakdowns: Group results by temporal subclass, category, classification, and delta bins.
  • Charts: Grouped bar chart (avg scores by subclass) and scatter plot (Prod vs Alt6 totals with 45-degree reference line).
  • Export: Download results as CSV (flattened) or JSON (full run object). Includes run metadata, weights, and per-criterion scores.
  • Run History: Last Run and Previous Run are preserved in browser storage. On page reload, persisted runs are restored. Incomplete runs from crashes can be recovered.

Last updated: March 3, 2026

Technical Documentation (for Claude Code)

Last updated: 2026-01-31

Dashboard Startup Commands

There are two dashboards in this project:

  • Ask Dashboard (Q&A monitoring) - Port 5002
    .venv/bin/python3 dashboards/run_ask_dashboard.py
    URL: http://localhost:5002
  • Article Dashboard (streaming article monitor) - Port 8083
    .venv/bin/python3 dashboards/backend/flask_dashboard.py
    URL: http://localhost:8083

Important: Dashboard File Locations

Warning: There is a deprecated flask_dashboard.py in the project root. Do NOT use it.

  • Correct location: dashboards/backend/flask_dashboard.py - Uses templates from dashboards/templates/
  • Deprecated (DO NOT USE): flask_dashboard.py (root) - Uses old dashboard_template.html with mismatched API endpoints

Directory Structure

dashboards/
├── backend/
│   └── flask_dashboard.py    # Article Dashboard backend (port 8083)
├── templates/
│   ├── ask_dashboard.html    # Ask Dashboard template
│   └── main_dashboard.html   # Article Dashboard template
└── run_ask_dashboard.py      # Ask Dashboard backend (port 5002)

Restarting Dashboards

To kill and restart a dashboard:

# Ask Dashboard
pkill -9 -f "run_ask_dashboard" 2>/dev/null; sleep 1; .venv/bin/python3 dashboards/run_ask_dashboard.py &

# Article Dashboard
pkill -9 -f "flask_dashboard" 2>/dev/null; sleep 1; .venv/bin/python3 dashboards/backend/flask_dashboard.py &

Article Score Formula

For Static Content:
score = 0.50 × score_semantic_pct + 0.15 × score_bm_25_pct + 0.35 × score_cross_pct
For Non-Static Content:
score = 0.15 × score_semantic_pct + 0.35 × score_decay_pct + 0.15 × score_bm_25_pct + 0.35 × score_cross_pct
score_semantic_pct = semantic similarity (percentile)
score_decay_pct = time-decayed score (percentile)
score_bm_25_pct = BM25 keyword matching (percentile)
score_cross_pct = cross-encoder score (percentile)

Decay Score Formula

score_decay = base_score × e(-λ × tp)
score_decay = adjusted score after time decay
base_score = original semantic score
e = Euler's number (≈ 2.71828)
λ = decay rate = 0.03 day-1
t = time since publication (days)
p = power = 1.25