Ask Dashboard - Real-time Q&A Monitoring

Last updated: Never
Latest (2026-04-26): Naming cleanup. Five destinations exist: BRT, RWR, TAH-EA, TAH-ER, CRBN. Reference-shaped TAH questions are rerouted to CRBN scoring via the reference reroute. The legacy "RefAware" tag has been retired everywhere — it was the same destination as CRBN with identical parameters, and keeping the tag was confusing readers into thinking a separate destination still existed. | Earlier (2026-04-25): Three pipeline simplifications shipped together — TahComparison destination retired (comparison questions now route to CRBN), TAH Fallback machinery stubbed (never fired in 800-question audit), SinceNow stubbed (R10 audit showed no advantage over CRBN). Recency Override's ER trigger fixed to use tw_start instead of tw_end so long-span ER questions don't get demoted.
🔀 ROUTING Reference reroute: | Cascade: ▼ show details
0
0
— of — questions
— answered (—)
Cited
Top-8 Pool
Prod
Alt6-Raw
domain tally ▾ by subclass ▾
Domain Count DR
Date Range:
Custom: to
Auto-refresh ON
Non-Backstop
-
-
-
-
-
-
-
-
-
-
Backstop
-
-
-
-
-
-
-
-
-
-
Backstop Breakdown
Category/Classification Breakdown Tables

Category Breakdown

Classification Breakdown

Temporal Subclassification Breakdown — post-routing destination

Alt6 Ranking Analysis (Temporal Decay vs Prod)

Ranking Metrics Settings ()

Pre-1970 Cited Articles (Missing Dates)
Data Quality
Missing/Zero Component Scores
Percentage of articles with null or 0 values
Component % Null/Zero Count Total
Semantic - - -
BM25 - - -
Cross Encoder - - -
Global Exclusion Criteria
Always Excluded Classifications: Adult Content, Conspiracy Theory, Gambling
Alt Score 6 Configuration
📊 Relevance reference pool: sources (cached at page load)
Session Reference Pool pCross, pBM25, and pSem are session-wide true percentiles computed from raw scores against the global pool (all sources across all questions, frozen at startup).

New questions arriving via polling are scored against this cached pool.

Pool size determines the granularity of RelPct rankings.
Scoring Formulas
Relevance = wCross × pCross + wBM25 × pBM25 + wSem × pSem
RelPct (rank) = percentile_rank(Relevance) over session pool
RelPct (rank) — Percentile Rank Percentile rank of this source's Relevance vs the session reference pool.

0.90 means this source's Relevance is higher than ~90% of pooled sources.

Not a probability; not an absolute "% relevant."

Computed using an empirical CDF (midrank for ties).
Alt6 = RelPct (rank) × DecayFactor × AnchorFactor × WindowFactor × TemporalCompat × EntityPresence
DecayFactor = max(floor, exp(−ln(2) × λ × ageInDays / halfLife))
EntityPresence = boost/penalty based on whether question entities appear in source text. Graduated by entity count: 1→0.60, 2→0.50, 3+→0.40 penalty. Boost: 1.20 (title) / 1.12 (desc) / 1.10 (content). Toggleable.
TAH Exception: All TAH subtypes (Event Anchored, Explicit Range, Comparison) — DecayFactor = 1.0 (recency decay disabled)
Reference Reroute: When enabled, EA/ER/Comparison/Fallback questions detected as reference-style lookups → AnchorFactor=1.0, WindowFactor=1.0, DecayFactor uses CRBN params (default t½=180d, floor=0.70). Gentle freshness preference — relevance dominates. Controlled by toggle + per-tile override.
Global Controls — all Alt6 scoring parameters
(sum: 1.00 ✓)
Higher = faster decay
Lower = harsher penalty
Entity Presence ⓘ graduated boost/penalty by entity count
DR Impact ⓘ three factors; defaults ×1.00 (disabled)
Premium (DR ≥ threshold)
Penalty (DR ≤ threshold)
Null DR
(no DR)
Weak-Cluster Rescue
Detects when the current top-ranked group is semantically weak relative to stronger buried alternatives. A buried source must pass 2 gates: (1) semantic > benchmark mean + lift (default 0.18), (2) semantic rank ≤ rank cap (default 5). Two hardcoded safety nets (P75 and absolute floor) provide backstop filtering. Qualifying candidates join a rescue pool with the benchmark and are reranked with semantic-led weights (75% semantic / 25% BM25 / 0% CE). Top sources from the reranked pool are sent to generation.
Trigger Gates
Rescue Behavior
Reranking Weights
(sum: 1.00 ✓)
Temporal Compatibility ⓘ year-presence check (TAH only, tw primary)
Reference Reroute ⓘ reroutes reference-like temporal questions to gentle CRBN decay |
Parameter Sweep Sweep scoring parameters to find optimal values. Check boxes to activate parameters for sweeping. Unchecked params show current values (held constant).
Temporal Intent–Specific Decay: Parameters
Configure decay parameters per temporal subclass
Subclass #Q Half-Life
(days)
Floor
(min decay)
Age → Floor
(days)
Notes
BRT - hrs
RWR (+UNKNW, +cascade) -
CRBN (+Future-CRBN, +reference-rerouted) -
TAH Anchor & Window Scoring (Active Parameters)
TAH-EA and TAH-ER skip exponential recency decay (DecayFactor = 1.0). They use the parameters below instead. Live-tunable; values write to TAH_ANCHOR_CONFIG.
Subclass #Q Distance Half-Life (days) Floor (min factor) Est-Date Penalty Scoring Mechanism
EvtAnch/TAH - ×0.80 Anchor-distance scoring — factor = max(floor, exp(-ln2 × |pub_date − event_date| / halfLife)). Recent EA → BRT/RWR override.
ExpRng/TAH - ×0.80 Window-compliance scoring — in-window: factor = 1.0. Outside: factor = max(floor, exp(-ln2 × distance_to_nearest_boundary / halfLife)).
Note: These parameters apply only when a TAH-EA / TAH-ER question stays on the TAH scoring path. Two reroutes can pull questions off this path before scoring runs: Recency override — if the anchor age ≤ BRT or RWR age-to-floor, the question is scored with BRT/RWR exponential decay instead. Reference reroute — reference-shaped TAH questions are rerouted to CRBN's gentle decay.
How Source Scoring Works

1. The Big Picture

For every question, the dashboard pulls a pool of candidate articles from search and scores each one with a single number called Alt6. Higher Alt6 = better candidate. The top sources by Alt6 are what eventually get sent to the LLM to write the answer.

Alt6 is the product of seven factors. Most factors default to 1.0 (neutral), so a source's score is really driven by the two or three factors that matter for that particular question:

Alt6 = RelevancePct × DecayFactor × AnchorFactor × WindowFactor × TemporalCompat × EntityPresence × DRImpact

Last updated 2026-04-26. The formula above is the literal product computed in combineAltScoreFactors(). Every factor below explains what it is, when it's active, and what its current default is.

2. The Pipeline, Step by Step

A question goes through three phases: routing (decide which scoring profile to use), scoring (compute Alt6 per source), and rescue (an optional re-rank if the top group looks weak).

2A. Routing — pick a scoring profile

The classifier labels each question as one of four shapes: BRT (breaking news), RWR (recent recap), CRBN (reference / historical / encyclopedic), or TAH (time-anchored — has either an event date or an explicit time window). TAH splits further into EA (event-anchored) and ER (explicit range). Five destinations exist after routing: BRT, RWR, TAH-EA, TAH-ER, CRBN.

Before scoring runs, several rules can reroute a question to a different scoring profile. Rules are checked in this order — first match wins:

  1. Reference Reroute — old, reference-shaped TAH questions (e.g. "what happened during the Roman Empire?", fiction lookups, very old factual entertainment questions) are pulled off TAH scoring and onto CRBN's gentle decay. Triggers depend on category, classification (Factual/Investigative), anchor age, and reference-keyword regex matches. Tunable thresholds: EA ≥ 2yr, ER ≥ 4yr, Investigative ≥ 8yr.
  2. Recency Override — TAH questions whose anchor (EA) or window-start (ER) is recent enough that BRT or RWR would suit better get rerouted to BRT or RWR scoring. The threshold is whatever age makes BRT/RWR's decay curve hit its floor — change BRT or RWR's half-life and the threshold moves automatically.
  3. Future Routing — questions about future events (the classifier emits a future flag) route to a Future-CRBN / Future-BRT / Future-RWR variant of the same scoring curves.
  4. Cascade Fallback — if a BRT or RWR question has no fresh enough sources to fill a top-8, it falls back to the next looser curve: BRT→RWR→CRBN.
  5. Direct passthrough — if none of the above fire, the classifier's label is used as-is.

2B. Scoring — compute Alt6 per source

For every source in the pool, the dashboard computes the seven factors below, then multiplies them. Each factor is detailed in its own section further down.

  • RelevancePct — how relevant the source is, as a percentile against the whole session pool.
  • DecayFactor — how fresh the source is (BRT/RWR/CRBN questions). 1.0 for TAH-EA/ER.
  • AnchorFactor — how close the source's publish date is to the event date (TAH-EA only). 1.0 otherwise.
  • WindowFactor — whether the source falls inside the requested window (TAH-ER only). 1.0 otherwise.
  • TemporalCompat — does the article title/description mention the right year(s)? (TAH only). 1.0 otherwise.
  • EntityPresence — do the people named in the question appear in the source? (Any subclass.) 1.0 if no entities extracted.
  • DRImpact — domain reliability adjustment. Currently a no-op (all multipliers default to 1.00; the knob exists but is not in use).

2C. Rescue — optionally re-rank if the top group looks weak

After Alt6 is computed and sources are sorted, the Weak-Cluster Rescue (WCR) system looks at the top 8. If those 8 are semantically weaker than buried alternatives further down the list, WCR re-ranks a small "rescue pool" using semantic-led weights and the resulting top-6 is what's actually sent to the LLM. WCR is on by default. Details in section 9.

3. RelevancePct — How Relevant Is This Source?

Each source has three independent relevance signals from the search backend:

  • Cross-encoder score — a transformer reading the question and the source together and producing a relevance score. Most accurate but slowest.
  • BM25 score — classical keyword-overlap score.
  • Semantic (bi-encoder) score — cosine similarity between the question's vector and the source's vector.

Each raw score is converted to a percentile against the entire session pool, then blended:

Relevance = wCross × pCross + wBM25 × pBM25 + wSem × pSem

That blended Relevance is then itself percentile-ranked against the session pool to get RelevancePct, which is what enters the Alt6 formula.

  • Default weights: wCross = 0.75, wBM25 = 0.075, wSem = 0.175. Tunable in Global Controls; weights should sum to 1.0.
  • Why double-percentile? The first percentile normalizes each raw signal so they can be averaged; the second percentile turns the blended Relevance into a 0–1 score that plays nicely as a multiplier.
  • Caching: Both percentile distributions are built once at session start from the entire pool of candidate sources and locked. Tuning the weights re-blends Relevance using the cached percentiles but does not rebuild the cache.
  • Null handling: A source with no cross-encoder and no semantic score is excluded entirely (Alt6 = null).

4. DecayFactor — How Fresh Is the Source? (BRT, RWR, CRBN)

For recency-sensitive subclasses, an exponential half-life decay measures how old the source is relative to when the question was asked:

DecayFactor = max(floor, e-ln2 × age_in_days / halfLife)

  • halfLife = days at which the factor reaches exactly 0.50.
  • floor = minimum factor — sources older than ~10 half-lives all get this same value, so age stops discriminating beyond that point.

Current defaults (visible and editable in the "Temporal Intent–Specific Decay" panel above):

Subclass half-life floor Used for
BRT1 day (24 hr)0.10Breaking news
RWR14 days0.25Recent recap; also UNKNW & cascade fallback
CRBN180 days0.70Reference / historical / encyclopedic; also reference reroutes

For TAH-EA and TAH-ER questions, DecayFactor is hard-coded to 1.0 — the AnchorFactor / WindowFactor below replace it.

5. AnchorFactor — Distance to Event Date (TAH-EA only)

Event-anchored questions like "What happened during the 2024 election?" don't care that an article is fresh — they care that an article was written around the time of the event. AnchorFactor scores each source by how close its publish date is to the event date, in either direction:

AnchorFactor = max(floor, e-ln2 × |pub_date − event_date| / halfLife)

  • Defaults: half-life = 120 days, floor = 0.27. Editable in the "TAH Anchor & Window Scoring" panel above.
  • The distance is the absolute value — sources written before or after the event decay equally.
  • An article published on the event date itself gets factor = 1.0.

6. WindowFactor — Inside or Outside the Window? (TAH-ER only)

Explicit-range questions like "African footballers in August 2025" carry a window [tw_start, tw_end]. Sources inside the window get full credit. Sources outside decay based on distance to the nearest window boundary:

WindowFactor = 1.0   (if pub_date is inside [tw_start, tw_end])
WindowFactor = max(floor, e-ln2 × distance_to_nearest_boundary / halfLife)   (otherwise)

  • Defaults: half-life = 180 days, floor = 0.27.
  • Boundary-inclusive: a source published exactly on tw_start or tw_end is in-window.
  • Position labels in the UI: IN inside, BEF before, AFT after, UNK estimated date.

Estimated-Date Penalty (TAH-EA + TAH-ER)

Some sources don't have a real published_at; the backend falls back to article_inserted_at (the crawl date) and flags published_at_estimated = true. Crawl dates have no real temporal relationship to the content, so anchor/window scoring of those sources is unreliable. Each gets a flat penalty multiplier on top of the normal anchor/window factor:

Factor = normalFactor × (1 − estimatedDatePenalty)

Default penalty = 0.20 (multiplier ×0.80). Editable per TAH-EA / TAH-ER row above.

7. TemporalCompat — Year-Mention Check (TAH only)

For TAH questions, the title/description of each source is scanned for the question's target year(s). The result is one of three outcomes:

  • Match — target year appears → multiplier ×1.15
  • Mismatch — a different year appears (no target year) → multiplier ×0.80
  • Neutral — no year mentioned at all → multiplier ×1.00

Target year priority: classifier-provided tw_start/tw_end first, regex over the question text as fallback. Cross-year windows (e.g. 2019–2020) accept any year in the range. Both multipliers are tunable in the config row above. Only TAH subclasses use this — BRT/RWR/CRBN already have recency decay doing similar work.

8. EntityPresence — Are the Right People Named? (any subclass)

When a question mentions specific people ("What did Trey Anastasio say about ..."), sources that fail to mention those people are usually wrong. EntityPresence checks for that.

Step 1 — Extract entities from the question

  • Find runs of 2+ capitalized words (so single proper nouns like "Starlink" don't qualify).
  • Stop words excluded; curly apostrophes normalized; possessive suffixes ('s, s') stripped and treated as a soft word boundary.
  • Example: "Elon Musk's Starlink" extracts as ["Elon Musk"].

Step 2 — Match each entity against the source's title and description

For each entity in each tier, the result is one of:

  • full — every word of the entity appears as a whole-word token
  • partial — some words appear (treated as ×1.00 neutral — e.g. matching "Brodsky" doesn't prove the article is about Stephen Brodsky)
  • none — no words appear

Step 3 — Apply boost or penalty

  • Boosts (when all entities matched fully): title → ×1.20, description → ×1.12.
  • Penalties, graduated by entity count:
    • 1 entity, none found → ×0.60
    • 2 entities, 1 of 2 found → ×0.90; none found → ×0.50
    • 3+ entities, majority found → ×0.95; minority → ×0.70; none → ×0.40

Toggle on/off via the Entity Presence panel. Applies to standard Alt6 ranking and to WCR rescue-state ranking.

9. Weak-Cluster Rescue (WCR) — Re-rank if the Top Group Looks Weak

After Alt6 sorting, the dashboard examines the top 8 (the "benchmark group"). If a buried source has notably stronger semantic similarity to the question than the benchmark average, the dashboard suspects the top group is dragging on weaker signals and runs a two-stage rescue.

Stage 1 — Detect a weak cluster

For each source NOT in the benchmark top-8, test:

  • Lift gate (primary): source's semantic score > (benchmark mean + 0.18). The 0.18 is tunable.
  • Rank cap: source's semantic rank across all candidates ≤ 5.
  • Two safety nets (rarely binding, hardcoded): semantic > benchmark P75; semantic > 0.50 absolute floor.

If at least one buried source passes all of the above, a weak cluster is detected.

Stage 2 — Build a rescue pool and re-rank

  • Pool = the original top 8 + up to 4 qualifying buried candidates (so up to 12 sources total).
  • Pool members are re-scored with semantic-led weights:

    wcrAlt6 = (0.75 × pSemLocal + 0.25 × pBM25 + 0.00 × pCross) × DecayFactor × AnchorFactor × WindowFactor × TemporalCompat × EntityPresence × DRImpact

    (BM25 and cross-encoder weights are tunable; defaults shown.)
  • The pool is sorted by wcrAlt6, and the top 6 (configurable) are sent to the LLM. Non-pool sources keep their original ranks below the pool.

Three possible outcomes per question:

  • no_weak_cluster — benchmark looked fine, nothing changed.
  • weak_cluster_no_candidates — lift gate passed but rank cap filtered everyone out.
  • rescue_active — pool was formed and re-ranked.

WCR is on by default. Toggle in the Weak-Cluster Rescue panel.

10. DRImpact — Domain Reliability Adjustment (currently a no-op)

DRImpact is a multiplier driven by the source's domain reliability score. Three buckets (premium / penalty / null) each have their own multiplier. Default values are all 1.00, so DRImpact is effectively neutral on every source today — the formula slot exists, the knob is wired up, but no boost or penalty fires until the multipliers are moved off 1.00 in the DR Impact panel.

11. Caching — What Changes When You Tune Knobs

Some pieces are computed once at session startup and locked; others recompute every time you change a parameter. This matters because tuning weights doesn't change percentile distributions, only how the percentiles are blended.

Locked at startup (do NOT recompute):

  • Per-signal raw-score percentile arrays (cross, BM25, semantic) — built from all sources in the session.
  • The session's Relevance distribution (used to convert blended Relevance to RelevancePct).

Recomputed on every config change:

  • Routing decisions (recency override, reference reroute, cascade) — threshold-driven, so they react to BRT/RWR/CRBN half-life and floor edits.
  • Decay, anchor, window, temporal-compat, entity-presence factors — recomputed per source.
  • Alt6 score and rank — recomputed per question.
  • Weak-Cluster Rescue — re-runs against new ranks.

12. One-Page Summary

For most questions, only two or three factors meaningfully move the score:

  • Breaking news (BRT): RelevancePct × aggressive recency decay (24-hour half-life, 0.10 floor). Older sources get crushed fast.
  • Recent recap (RWR): RelevancePct × moderate decay (14-day half-life, 0.25 floor).
  • Reference / historical (CRBN): RelevancePct × gentle decay (180-day half-life, 0.70 floor). Age barely matters; relevance dominates.
  • TAH event-anchored (EA): RelevancePct × AnchorFactor (proximity to event date) × TemporalCompat. Recency is irrelevant; anchor distance is the active signal.
  • TAH explicit-range (ER): RelevancePct × WindowFactor (in-window vs distance to boundary) × TemporalCompat. Same idea as EA but with a window instead of a single date.

EntityPresence boosts or penalizes on top of all of the above when the question names specific people. WCR can re-rank the top group when semantic similarity tells a different story than the blended Alt6.

Relevance distribution: Not yet built (will build at startup)
Source Metrics Scatter Plot (All Questions)
Answer Generation & Export Generate Alt6 Raw answers and export comprehensive CSV
Random sample from currently filtered questions. Skips questions with 0 eligible Alt6 sources. Sampled query_uuids logged to the browser console for reproducibility.
Random sample from currently filtered questions. Each block contains: Prod answer + Prod top-8 sources + Alt6 answer (prompt v${''}) + Alt6 answer (prompt v${''}, if generated) + Alt6 top-8 sources. Questions that have at least one Alt6 answer are preferred over Prod-only questions.
Exports prod answers and source-level scoring data for filtered questions. No answer generation needed.
Query UUID filter
No UUID filter active.
Study Review & Eval Open the picker to review any completed study output. Filter by study type / whether the judge has run / whether you've started recording verdicts. The modal shows aggregate stats and lets you walk question-by-question to compare your call to the judge's.
Rerouting Study One active reroute group (tested 2-way): Ref-rerouted (TAH questions promoted to CRBN scoring via the reference reroute). Plus Generic-baseline: stratified sample of non-rerouted questions tested against their default destination only (BRT / RWR / TAH / CRBN strata, capped at GENERIC_PER_STRATUM). Saves summary.csv, detail.md, sources.csv, raw_results.json, meta.json to dashboards/data/rerouting_study/<timestamp>/.
DR Impact Study A/B test on the currently filtered question set. Baseline = all DR multipliers forced to 1.0; Variant = whatever you have entered in the DR Impact panel (premiumMultiplier, penaltyMultiplier, nullMultiplier). All other Global Controls (model, prompt version, weights, decay, entity presence, WCR, reroute) are held constant. Pre-flight pass reports the % of filtered questions whose top-8 actually changes under the variant. Saves summary.csv, sources.csv, detail.md, raw_results.json, meta.json to dashboards/data/dr_impact_study/<timestamp>/.
DR Overlay Study Load a DR cache (e.g. dashboards/data/dr_overlay/v385_results.json) — array of {domain, dashboard_summary.new_score, ...} records OR {domains: {...}} map. Both sides use cache DR for eligibility AND bucketing. Baseline = cache DR + all multipliers forced to 1.0 (DR Impact effectively off). Variant = cache DR + entry-box multipliers (DR Impact on, configured by user above). Sources missing from cache fall back to prod DR on both sides. Set the DR Impact entry-boxes (e.g. premiumThreshold 4.75, premiumMultiplier 1.15, penaltyThreshold 3.5, penaltyMultiplier 0.9, nullMultiplier 0.8) BEFORE running. Saves to dashboards/data/dr_overlay_study/<timestamp>/.
No cache loaded.
Factor Ablation Dump Samples 25 questions (5 per reroute group + 5 generic) and dumps the COMPLETE candidate source list for each destination with every per-factor value (relevancePct, decayFactor, anchorFactor, expRngFactor, temporalCompat, entityPresence, drImpact). No LLM calls. Saves to dashboards/data/full_source_dump/<timestamp>/. ~30-60s.
Source Quality: With vs Without Backstop Sources
Initializing dashboard...

How to Use the Ask Dashboard

Last Updated: April 27, 2026

Overview

This dashboard monitors questions asked through the Ask module in real-time. It displays questions, answers, source articles, named entities, and performance metrics.

Header Controls

  • Auto-refresh: Toggle to enable/disable automatic data refresh
  • Interval: Set how often the dashboard refreshes (1-15 minutes)
  • Last updated: Shows when data was last fetched

Summary Statistics

  • Total Questions: Number of questions in the current view
  • Backstop Queries: Questions that required fallback processing
  • Unique Cited Domains: Count of distinct domains actually cited in answers. Hover over this box to see a breakdown showing each domain, how many times it was cited, and its average Domain Reliability (DR) score.
  • Last updated: Shows when the dashboard last checked for new questions

Date Range Selection:

  • Preset Buttons: Quick selection options - Today, Yesterday, Last 7/15/30/60/90 days (default: Last 15)
  • Custom Date Range: Enter specific start and end dates, then click "Apply" to filter. Click "Clear" to return to preset mode.
  • Mutual Exclusivity: Preset buttons and custom dates are mutually exclusive - selecting one disables the other.
  • Auto-refresh Indicator: Shows whether auto-refresh is enabled (green dot = ON, gray dot = OFF).

Auto-refresh Behavior:

  • Auto-refresh ON: Today, Last 7, Last 15, Last 30, Last 60, Last 90 days (live monitoring)
  • Auto-refresh OFF: Yesterday and Custom date ranges (historical data, no updates expected)

Performance metrics are split into two rows:

  • Non-Backstop row (green): Averages for queries answered without fallback
  • Backstop row (red): Averages for queries that required fallback processing

Each row shows: Total, Search, Answer, Suggest, Other, Words, W/Sec, Cites, and Avg Score (average article score of cited sources).

Note: Click section headers to expand/collapse the Category/Classification Breakdown Tables and Alt6 Ranking Analysis sections.

Category / Classification / Temporal Subclass Breakdown tables:

All three tables share the same column structure, grouped by category, classification, or temporal subclass respectively. Updated 2026-03-12.

  • Name: The grouping key (category, classification, or temporal subclass)
  • #: Number of questions in that group
  • Prod Top 8 DR: Average domain reliability of top 8 sources by prod rank
  • Prod Cited DR: Average domain reliability of sources cited in the prod answer
  • Alt6 Raw: Average altScore6 of the top 8 Alt6 Raw sources (null DR allowed, below-threshold DR blocked)
  • Raw Top 8 DR: Average domain reliability of the top 8 Alt6 Raw sources
  • Raw Cited: (appears when Raw answers have been generated) Average altScore6 of sources cited in the generated Raw answer

Click the sort arrows on any column header to sort the table. Default sort is by count descending.

Source Quality: With vs Without Backstop Sources:

Compares Alt6 Raw top-8 metrics when backstop articles (source_origin 8/9/10) are excluded vs included in the source pool. Updated 2026-04-14.

  • Excluding BS Sources: Top 8 picked after removing backstop articles from the pool.
  • All Sources: Top 8 picked from the full pool including backstop articles.
  • Raw columns (Score, Rel, Decay, DR): Metrics for top 8 Alt6 Raw sources (null DR allowed, below-threshold DR blocked, classification excluded).
  • Delta row: All Sources minus Excluding BS. Green = backstop improves metric, Red = backstop dilutes.

Alt6 Ranking Analysis (Temporal Decay vs Prod):

Compares how Alt Score 6 (temporal decay scoring) ranks sources differently from Production. Alt6 applies subclass-specific half-life decay to relevance scores, which can significantly change rankings for time-sensitive content.

Two-Tier Ranking System: Alt6 ranking uses the same two-tier approach as Production ranking:

  • Tier 1 (Top positions): Non-excluded sources ranked by Alt6 score (highest = rank 1)
  • Tier 2 (Bottom positions): Excluded sources ranked by Alt6 score among themselves
  • An excluded source with a high Alt6 score will always rank below a non-excluded source with a lower Alt6 score
  • Exclusion criteria: backend is_excluded flag, classification (adult/conspiracy/gambling), and user-configured thresholds (DR, Score, Semantic, etc.)

Ranking Shift Metrics:

  • Avg absolute change: Average number of slots sources moved (regardless of direction)
  • Avg improvement/worsening: Average slot change for sources that improved/worsened
  • % improved/worsened: Percentage of sources that moved up/down in ranking
  • Bottom 1/3 -> Top 1/3: Sources that jumped from bottom third to top third
  • Top X dropped D+ slots: Top-ranked sources that dropped significantly
  • Bottom 50% -> Top X: Sources from bottom half that made it to top X (only for questions with 50+ sources)
  • Jumped/Dropped K+ slots: Sources with significant rank changes

Configure thresholds (Top X, K, D) in Ranking Metrics Settings. Summary shows percentages; per-question shows raw counts.

Ranking Shift Metrics (per question):

Expand this section on any question card to see detailed ranking shift metrics for that specific question, comparing Alt6 vs Production ranking.

Master Article Decay Score Settings:

Configure alternative decay score formulas. The decay formula is: Decay Score = Base Score × e^(-(λ × (t^p)))

  • Prod Decay Score: Read-only display of production values (λ=0.03, p=1.25)
  • Alt Decay Score 1 & 2: Configure custom decay formulas

Variables (nested dependencies):

  • Base Score: Semantic/embedding score (always independently selectable)
  • e (Euler's number): Constant 2.71828 - enables exponential decay
  • ^(-(exponent)): Negative exponential (requires e to be checked)
  • λ (Decay Rate): Controls decay speed (requires ^(-(exponent)))
  • t (Days Since Publication): Time variable (requires ^(-(exponent)))
  • p (Power): Exponent for time (requires t)

Check the Save checkbox to persist your configuration across page refreshes.

Alt Decay Columns (AD1 & AD2):

When Alt Decay Score 1 or 2 is enabled, two new columns appear in the Sources table: AD1 and AD2. These columns display:

  • Computed Value: The decay score calculated using your configured formula (e.g., 0.847)
  • Formula Breakdown: Shows the actual values used in the computation (e.g., 0.923 × e^(-0.05 × 14.5^1.5))
  • Column Header Tooltip: Hover over AD1/AD2 header to see the configured formula

Key Details:

  • Alt Decay uses the raw semantic score (not percentalized)
  • Days since publication (t) is calculated from source's published_at vs question's asked_at
  • Columns update immediately when you change configuration values while toggle is ON
  • When toggle is OFF, columns display "-" for all sources

Show PCT Mode:

  • When "Show PCT" is enabled, Alt Decay columns display percentalized values (0-1 range)
  • Percentalization uses min-max normalization across all sources within the question
  • Column headers show "(%) " suffix when PCT mode is ON
  • Highest Alt Decay value becomes 1.0, lowest becomes 0.0
  • In PCT mode, formula breakdown is hidden (only the percentile value is shown)

Filters

  • User: Filter questions by specific user
  • Classification: Filter by question type (Investigative, Temporally-Aware, etc.)
  • Category: Filter by content category (politics, business, sports, etc.)
  • Backstop: Filter by backstop status (All, Non-Backstop, or Backstop only)

All metrics and statistics update dynamically based on the current filter selection.

Question Cards

Each question is displayed as a card with expandable sections:

  • Answer: The generated response with citation numbers (tabs for Alt Answers in header)
  • Sources: Source articles used (click headers to expand)
  • Suggestions: Follow-up questions
  • Named Entities: People, organizations, locations identified
  • Performance Metrics: Timing data for the query
  • Ranking Shift Metrics: How Alt Score rankings compare to Prod rankings

Answer Section (Three-Column Layout):

The Answer section displays three columns side-by-side for easy comparison:

  • Answer (Column 1): The production answer from the original query
  • Alt Answer 1 (Column 2): Generate an alternative answer using top 8 sources ranked by Alt Score 1
  • Alt Answer 2 (Column 3): Generate an alternative answer using top 8 sources ranked by Alt Score 2

How to use the Answer Section:

  • Expand/Collapse: Click anywhere on the Answer header to toggle the section
  • Side-by-Side Comparison: All three answers are visible simultaneously when expanded
  • Vertical Dividers: Columns are separated by vertical lines for clarity
  • Equal Width: Each column takes 33% of the width

How Alt Answers work:

  • Alt Score 1 or 2 must be configured in Master Article Score Settings to enable generation
  • Click "Generate Alt Answer" to call the API with the top 8 sources by Alt Rank
  • Two-tier ranking ensures the top 8 sources are always non-excluded (excluded sources are ranked at the bottom)
  • The answer is generated using the same question but with different source articles
  • Generated answers are cached for the session (not persisted across page refreshes)

Question Badges

CATEGORY Content category (politics, sports, etc.)
CLASSIFICATION Question type classification
BACKSTOP [codes] Question required fallback processing. Reason codes (e.g. [1000, 1003]) indicate which criteria triggered the backstop call. Hover over the badge to see all 7 criteria with triggered ones highlighted.
Avg Prod Cited Score: 0.XXX Average production score of cited sources only
Avg Prod Top 8 Score: 0.XXX Average production score of top 8 sources by rank (or all sources if fewer than 8)
Avg Alt1 Top 8 Score: 0.XXX Average Alt1 score of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt2 Top 8 Score: 0.XXX Average Alt2 score of top 8 sources by Alt2 rank (n/a if Alt2 not configured)

Sources Section Badges

These badges appear in the Sources section for each question:

Avg Prod Cited Domain Reliability: 0.XXX Average domain reliability of cited sources only
Avg Prod Top 8 Domain Reliability: 0.XXX Average domain reliability of top 8 sources by rank
Avg Prod Cited Readability: 0.XXX Average readability of cited sources only
Avg Prod Top 8 Readability: 0.XXX Average readability of top 8 sources by rank
Avg Alt1 Top 8 Domain Reliability: 0.XXX Average DR of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt1 Top 8 Readability: 0.XXX Average readability of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt2 Top 8 Domain Reliability: 0.XXX Average DR of top 8 sources by Alt2 rank (n/a if Alt2 not configured)
Avg Alt2 Top 8 Readability: 0.XXX Average readability of top 8 sources by Alt2 rank (n/a if Alt2 not configured)

Sources Table

The sources table shows articles considered for the answer:

AgeTime since article was published
RnkArticle rank (lower = higher relevance)
ScoreCombined relevance score (0-1)
SemSemantic similarity score
BM25Keyword matching score
CrossCross-encoder relevance score
DRDomain Reliability score (0-100)
DepthContent Depth Score (0-5)
PosPositive Sentiment
NegNegative Sentiment
SentSentiment summary (sum of the positive and negative sentiment scores)
ReadReadability score
DecayTime adjusted semantic score
ClassArticle classification
ExclWhether the content is ignored

Source Row Colors

Pink background: Top-8 article NOT cited in the answer
White background: Article cited in the answer OR ranked below 8

Score Colors

0.600 High score (>= 0.5)
0.350 Medium score (0.2 - 0.5)
0.100 Low score (< 0.2)

Keyboard Shortcuts

  • Escape: Close this help dialog

LLM-as-Judge Evaluation

This feature uses OpenAI's GPT-4o-mini to evaluate and compare answer quality. Located between the Answer and Sources sections in each question card.

How to Use:

  • Expand the "LLM-as-Judge Evaluation" section
  • Click "Evaluate Answers" button to trigger evaluation
  • Evaluation includes the Production answer, plus Alt Answer 1 and/or Alt Answer 2 if generated

Prompt Modes:

  • Original — Standard evaluation prompt with 7 criteria
  • Alternate-1 (Core Ask Aware) — Adds a "Core Ask" reasoning step that forces the judge to identify the specific relationship/action/event before scoring. Strengthens relevance criterion (score capped at 3 if core ask not addressed) and evidence utilization (prefers evidence addressing the core ask). Comparison guidance penalizes answers that are merely broader/safer without answering the question.
  • Alternate-2 (Core Ask Gate) — 5-step evaluation scaffold: (1) Extract the Core Ask with concrete-involvement framing and minimum facts, (2) Core Ask Summary returned in JSON (max 20 words — visible as a chip in the UI), (3) Coverage Check classifying each answer as direct/partial/missed (returned in JSON — visible as color-coded badge per answer), (4) Scoring Constraint with HARD binding caps on relevance (direct → 1-5, partial → ≤4, missed → ≤2) and evidence_utilization (non-core-ask evidence → ≤4, mostly contextual → ≤3), (5) Winner Constraint requiring direct-coverage answers to beat missed-coverage answers unless factual errors. Stronger comparison guidance that prohibits preferring broader/safer answers. Designed for cases where Alt-1's soft cap is insufficient.

Select prompt mode in the batch eval controls. Switching modes counts as a settings change. Each result records which prompt mode was used. The Prompt Comparison section supports pairwise comparison across all mode pairs.

Evaluation Criteria (1-5 scale):

Relevance & Task Alignment (25%)Does the response directly address the question, cover all sub-questions, and avoid topic drift?
Faithfulness / Groundedness (20%)Are all factual claims supported by the retrieved sources without hallucination?
Claim–Evidence Mapping (15%)Can each claim be explicitly mapped to supporting source evidence?
Temporal Correctness (15%)Does the response correctly align with the temporal scope of the question and sources?
Evidence Utilization (10%)Does the response incorporate all key relevant evidence from the sources?
Internal Consistency (10%)Is the response free of logical contradictions and conflicting assertions?
Appropriate Uncertainty (5%)Does the response appropriately qualify uncertainty when evidence is insufficient?

Score Scale:

4 (VeryGood) Excellent performance on this criterion
3 (Good) Adequate performance with minor issues
2 (Bad) Noticeable problems affecting quality
1 (VeryBad) Significant failures on this criterion

Per-Criterion Reasoning:

Each criterion includes a detailed reasoning (2-3 sentences) explaining why that specific score was assigned. The reasoning appears below each criterion score in italic text.

General Commentary:

Each answer receives a holistic assessment with specific examples and observations. This appears at the bottom of each score card in a highlighted box.

Comparison Analysis:

When multiple answers are available (Prod + Alt1 and/or Alt2), the LLM provides a comparative analysis explaining which answer is best and why, considering both content quality and source relevance.

Source Context:

The evaluation considers the top 8 ranked sources (title, description, and fragment) for each answer type, using Production ranking for the Prod answer and Alt Score ranking for Alt answers.

Formulas

Article Score

The article score is calculated differently based on whether the content is static (e.g., Wikipedia) or has a publication date.

For Static Content:

score = 0.50 × score_semantic_pct + 0.15 × score_bm_25_pct + 0.35 × score_cross_pct
Where:
score_semantic_pct = semantic similarity score (percentile-normalized)
score_bm_25_pct = BM25 keyword matching score (percentile-normalized)
score_cross_pct = cross-encoder score (percentile-normalized)
Weights: Semantic: 50% • BM25: 15% • Cross-encoder: 35%

For Non-Static Content (with publication date):

score = 0.15 × score_semantic_pct + 0.35 × score_decay_pct + 0.15 × score_bm_25_pct + 0.35 × score_cross_pct
Where:
score_semantic_pct = semantic similarity score (percentile-normalized)
score_decay_pct = time-decayed score (percentile-normalized)
score_bm_25_pct = BM25 keyword matching score (percentile-normalized)
score_cross_pct = cross-encoder score (percentile-normalized)
Weights: Semantic: 15% • Time decay: 35% • BM25: 15% • Cross-encoder: 35%

Decay Score

Time decay adjusts the semantic score based on article age, giving more weight to recent content.

score_decay = base_score × e(-λ × tp)
Where:
score_decay = the adjusted score after applying time decay
base_score = the original semantic score before decay
e = Euler's number (≈ 2.71828)
λ = decay rate = 0.03 day-1
t = time since publication in days
p = power = 1.25

Alt6 Scoring (Temporal-Aware Relevance)

Alt6 is the production scoring formula for source ranking. It multiplies seven factors: most factors default to 1.0 (neutral) for any given question, so a source's score is really driven by the two or three factors that matter for that particular question. For an end-to-end methodology overview see dashboards/ALT6_METHODOLOGY.md.

Formula:

Alt6 = RelevancePct × DecayFactor × AnchorFactor × WindowFactor × TemporalCompat × EntityPresence × DRImpact
Where:
RelevancePct — relevance to the question, percentile-ranked against the session pool (always active)
DecayFactor — source freshness (BRT/RWR/CRBN questions only); 1.0 for TAH-EA/ER
AnchorFactor — proximity of source publish date to question's event date (TAH-EA only); 1.0 otherwise
WindowFactor — whether source falls inside the question's time window (TAH-ER only); 1.0 otherwise
TemporalCompat — title/description year-mention check (TAH only): match × 1.15, mismatch × 0.80, neutral × 1.00
EntityPresence — whether the people named in the question appear in the source: title boost × 1.20, description boost × 1.12, graduated penalties when missing
DRImpact — domain reliability adjustment (currently a no-op; all multipliers default to 1.00)

Relevance Calculation:

Relevance = wCross × pCross + wBM25 × pBM25 + wSem × pSem
Where:
wCross = 0.75 (default; configurable in Global Controls)
wBM25 = 0.075 (default)
wSem = 0.175 (default; weights should sum to 1.0)
pCross = cross-encoder percentile
pBM25 = BM25 percentile
pSem = semantic (bi-encoder) percentile
RelevancePct = percentile_rank(Relevance) across the session pool — this is what enters the Alt6 formula

Temporal Destinations & Decay Parameters:

Five surviving destinations after the 2026-04-25 simplification. Each question carries a "path" tag showing how it got there (cascade, override, reroute, future).

BRTBreaking / Real-Time. Half-life ~24 hrs, floor 0.10. Aggressive decay for breaking news only.
RWR (+UNKNW, +cascade)Recent Window / Recap. Half-life 14 days, floor 0.25. Catches: native RWR, UNKNW, Unclassified, all BRT/RWR cascade fallbacks.
EvtAnch/TAHTAH Event Anchored. Anchor-centered temporal factor (no recency decay). t½=120d, floor=0.27. Recent EA → BRT/RWR via recency override; old / reference-shaped EA → CRBN via reference reroute.
ExpRng/TAHTAH Explicit Range. Window-compliance scoring (in-window = 1.0, outside-window decay by distance to nearest boundary). t½=180d, floor=0.27.
CRBN (+Future-CRBN, +reference-rerouted)Reference / historical / encyclopedic. Very gentle decay. t½=180d, floor=0.70. Catches: native CRBN, Future-CRBN, and TAH questions caught by the reference reroute (EA / ER / Comp → CRBN). Used for fiction references, old historical lookups, definitional / structural questions.

Decay formula: DecayFactor = max(floor, 0.5^(age_days / half_life))

EvtAnch/TAH anchor formula: AnchorFactor = max(floor, exp(-ln2 × |pub_date - anchor_date| / half_life)). Symmetric around anchor date.

Path Tags (how a question got to its destination):

cascade from XBRT/RWR question backed off to RWR/CRBN because too few sources were within age-to-floor.
recency override (X→Y)TAH question's anchor/range was recent enough that BRT/RWR decay was applied instead of anchor/window.
reroute from XEA/ER/Comp question rerouted to CRBN (e.g. fiction references, very old historical lookups).
Future-X from YTAH question about a future event routed to BRT/RWR/CRBN based on how far ahead the event is.
UNKNW / UnclassifiedQuestion had no temporal subclass; collapsed to RWR with this path tag.

Unknown Publish Date Handling:

Sources with no publish date go through the normal decay calculation, which clamps at the subclass's floor:

BRT0.10 (floor)
RWR0.25 (floor)
TAH (EA / ER)1.0 — anchor / window scoring handles temporal relevance; estimated dates get an additional ×0.80 penalty (the TAH estimatedDatePenalty)
CRBN0.70 (floor)

Sources with no real publish date show "(no date)" indicator in purple.

UI Indicators:

  • Rel(X.XXX)→RelPct(X.XX) - Shows raw Relevance score and its percentile rank in the session pool
  • q-rank: N/M - Shows the source's rank within this question (N of M sources, sorted by Relevance descending)
  • ⓘ icons - Hover for detailed breakdown tooltips showing raw values and calculations
  • Session pool - Displayed in the Alt6 config panel, shows total sources used for percentile calculation
  • "inputs missing" - Shown when a source lacks cross_pct or bm25_pct values needed for Alt6
  • Decay badge - Shows temporal subclass and decay parameters near the question/answer

Score Deltas:

  • ScoreΔ = Alt6 - ProdScore (positive = Alt6 scores higher)
  • RankΔ = ProdRank - Alt6Rank (positive = Alt6 ranks the source higher)

Relevance Health Diagnostic

A per-question diagnostic that summarizes the absolute strength of the retrieved candidate set using RAW Relevance (not RelPct), to answer: "Do we have enough strong sources, or should we retrieve more / broaden search?"

Status Levels:

🟢 Strong max ≥ 0.75 AND count(Relevance ≥ 0.70) ≥ 3
🟡 Thin max ≥ 0.65 AND count(Relevance ≥ 0.60) ≥ 2
🔴 Weak — expand sources Otherwise (candidate set may need broader search)

Detail Panel (click to expand):

  • Counts: N total sources, N valid (with pCross and pBM25)
  • Distribution: max, p75, median of Relevance
  • Strength: Count of sources ≥0.70, ≥0.60, ≥0.50

Note: This diagnostic uses raw Relevance (time-independent signal: meaning + keywords) to detect when the candidate set is weak even if ranks or percentiles look fine. It does not affect ranking or selection.

Batch LLM Evaluation

Run LLM-as-Judge evaluation across multiple questions in one batch. Located between the analytics sections and the question list.

  • Evaluation Weights: Set per-criterion weights (must sum to 100%). These weights are applied to all questions in the batch.
  • Batch Size: Choose how many questions to evaluate (1, 5, 10, 20, or 30).
  • Source Scope: "All sources" uses unfiltered Alt6 Raw sources.
  • Start Evaluation: Picks the next N uncomputed questions from the current filtered view. For each question: generates an Alt6 Raw answer, then runs LLM-as-Judge on both the production answer and the Alt6 Raw answer.
  • Non-recompute: Questions already evaluated in the Last Run with the same weights are skipped.
  • Settings Change Warning: If you change weights, a confirmation dialog appears before starting.
  • Summary Grid: Shows all evaluated questions with scores, deltas, and metadata. Click a question to scroll to its card and auto-expand the LLM Judge section.
  • Breakdowns: Group results by temporal subclass, category, classification, and delta bins.
  • Charts: Grouped bar chart (avg scores by subclass) and scatter plot (Prod vs Alt6 totals with 45-degree reference line).
  • Export: Download results as CSV (flattened) or JSON (full run object). Includes run metadata, weights, and per-criterion scores.
  • Run History: Last Run and Previous Run are preserved in browser storage. On page reload, persisted runs are restored. Incomplete runs from crashes can be recovered.

Last updated: March 3, 2026

Technical Documentation (for Claude Code)

Last updated: 2026-01-31

Dashboard Startup Commands

There are two dashboards in this project:

  • Ask Dashboard (Q&A monitoring) - Port 5002
    .venv/bin/python3 dashboards/run_ask_dashboard.py
    URL: http://localhost:5002
  • Article Dashboard (streaming article monitor) - Port 8083
    .venv/bin/python3 dashboards/backend/flask_dashboard.py
    URL: http://localhost:8083

Important: Dashboard File Locations

Warning: There is a deprecated flask_dashboard.py in the project root. Do NOT use it.

  • Correct location: dashboards/backend/flask_dashboard.py - Uses templates from dashboards/templates/
  • Deprecated (DO NOT USE): flask_dashboard.py (root) - Uses old dashboard_template.html with mismatched API endpoints

Directory Structure

dashboards/
├── backend/
│   └── flask_dashboard.py    # Article Dashboard backend (port 8083)
├── templates/
│   ├── ask_dashboard.html    # Ask Dashboard template
│   └── main_dashboard.html   # Article Dashboard template
└── run_ask_dashboard.py      # Ask Dashboard backend (port 5002)

Restarting Dashboards

To kill and restart a dashboard:

# Ask Dashboard
pkill -9 -f "run_ask_dashboard" 2>/dev/null; sleep 1; .venv/bin/python3 dashboards/run_ask_dashboard.py &

# Article Dashboard
pkill -9 -f "flask_dashboard" 2>/dev/null; sleep 1; .venv/bin/python3 dashboards/backend/flask_dashboard.py &

Article Score Formula

For Static Content:
score = 0.50 × score_semantic_pct + 0.15 × score_bm_25_pct + 0.35 × score_cross_pct
For Non-Static Content:
score = 0.15 × score_semantic_pct + 0.35 × score_decay_pct + 0.15 × score_bm_25_pct + 0.35 × score_cross_pct
score_semantic_pct = semantic similarity (percentile)
score_decay_pct = time-decayed score (percentile)
score_bm_25_pct = BM25 keyword matching (percentile)
score_cross_pct = cross-encoder score (percentile)

Decay Score Formula

score_decay = base_score × e(-λ × tp)
score_decay = adjusted score after time decay
base_score = original semantic score
e = Euler's number (≈ 2.71828)
λ = decay rate = 0.03 day-1
t = time since publication (days)
p = power = 1.25