Overview
This dashboard monitors questions asked through the Ask module in real-time. It displays questions, answers, source articles, named entities, and performance metrics.
Header Controls
- Auto-refresh: Toggle to enable/disable automatic data refresh
- Interval: Set how often the dashboard refreshes (1-15 minutes)
- Last updated: Shows when data was last fetched
Summary Statistics
- Total Questions: Number of questions in the current view
- Backstop Queries: Questions that required fallback processing
- Unique Cited Domains: Count of distinct domains actually cited in answers. Hover over this box to see a breakdown showing each domain, how many times it was cited, and its average Domain Reliability (DR) score.
- Last updated: Shows when the dashboard last checked for new questions
Date Range Selection:
- Preset Buttons: Quick selection options - Today, Yesterday, Last 7/15/30/60/90 days (default: Last 15)
- Custom Date Range: Enter specific start and end dates, then click "Apply" to filter. Click "Clear" to return to preset mode.
- Mutual Exclusivity: Preset buttons and custom dates are mutually exclusive - selecting one disables the other.
- Auto-refresh Indicator: Shows whether auto-refresh is enabled (green dot = ON, gray dot = OFF).
Auto-refresh Behavior:
- Auto-refresh ON: Today, Last 7, Last 15, Last 30, Last 60, Last 90 days (live monitoring)
- Auto-refresh OFF: Yesterday and Custom date ranges (historical data, no updates expected)
Performance metrics are split into two rows:
- Non-Backstop row (green): Averages for queries answered without fallback
- Backstop row (red): Averages for queries that required fallback processing
Each row shows: Total, Search, Answer, Suggest, Other, Words, W/Sec, Cites, and Avg Score (average article score of cited sources).
Note: Click section headers to expand/collapse the Category/Classification Breakdown Tables and Alt6 Ranking Analysis sections.
Category / Classification / Temporal Subclass Breakdown tables:
All three tables share the same column structure, grouped by category, classification, or temporal subclass respectively. Updated 2026-03-12.
- Name: The grouping key (category, classification, or temporal subclass)
- #: Number of questions in that group
- Prod Top 8 DR: Average domain reliability of top 8 sources by prod rank
- Prod Cited DR: Average domain reliability of sources cited in the prod answer
- Alt6 Raw: Average altScore6 of the top 8 Alt6 Raw sources (null DR allowed, below-threshold DR blocked)
- Raw Top 8 DR: Average domain reliability of the top 8 Alt6 Raw sources
- Raw Cited: (appears when Raw answers have been generated) Average altScore6 of sources cited in the generated Raw answer
Click the sort arrows on any column header to sort the table. Default sort is by count descending.
Source Quality: With vs Without Backstop Sources:
Compares Alt6 Raw top-8 metrics when backstop articles (source_origin 8/9/10) are excluded vs included in the source pool. Updated 2026-04-14.
- Excluding BS Sources: Top 8 picked after removing backstop articles from the pool.
- All Sources: Top 8 picked from the full pool including backstop articles.
- Raw columns (Score, Rel, Decay, DR): Metrics for top 8 Alt6 Raw sources (null DR allowed, below-threshold DR blocked, classification excluded).
- Delta row: All Sources minus Excluding BS. Green = backstop improves metric, Red = backstop dilutes.
Alt6 Ranking Analysis (Temporal Decay vs Prod):
Compares how Alt Score 6 (temporal decay scoring) ranks sources differently from Production. Alt6 applies subclass-specific half-life decay to relevance scores, which can significantly change rankings for time-sensitive content.
Two-Tier Ranking System: Alt6 ranking uses the same two-tier approach as Production ranking:
- Tier 1 (Top positions): Non-excluded sources ranked by Alt6 score (highest = rank 1)
- Tier 2 (Bottom positions): Excluded sources ranked by Alt6 score among themselves
- An excluded source with a high Alt6 score will always rank below a non-excluded source with a lower Alt6 score
- Exclusion criteria: backend is_excluded flag, classification (adult/conspiracy/gambling), and user-configured thresholds (DR, Score, Semantic, etc.)
Ranking Shift Metrics:
- Avg absolute change: Average number of slots sources moved (regardless of direction)
- Avg improvement/worsening: Average slot change for sources that improved/worsened
- % improved/worsened: Percentage of sources that moved up/down in ranking
- Bottom 1/3 -> Top 1/3: Sources that jumped from bottom third to top third
- Top X dropped D+ slots: Top-ranked sources that dropped significantly
- Bottom 50% -> Top X: Sources from bottom half that made it to top X (only for questions with 50+ sources)
- Jumped/Dropped K+ slots: Sources with significant rank changes
Configure thresholds (Top X, K, D) in Ranking Metrics Settings. Summary shows percentages; per-question shows raw counts.
Ranking Shift Metrics (per question):
Expand this section on any question card to see detailed ranking shift metrics for that specific question, comparing Alt6 vs Production ranking.
Master Article Decay Score Settings:
Configure alternative decay score formulas. The decay formula is: Decay Score = Base Score × e^(-(λ × (t^p)))
- Prod Decay Score: Read-only display of production values (λ=0.03, p=1.25)
- Alt Decay Score 1 & 2: Configure custom decay formulas
Variables (nested dependencies):
- Base Score: Semantic/embedding score (always independently selectable)
- e (Euler's number): Constant 2.71828 - enables exponential decay
- ^(-(exponent)): Negative exponential (requires e to be checked)
- λ (Decay Rate): Controls decay speed (requires ^(-(exponent)))
- t (Days Since Publication): Time variable (requires ^(-(exponent)))
- p (Power): Exponent for time (requires t)
Check the Save checkbox to persist your configuration across page refreshes.
Alt Decay Columns (AD1 & AD2):
When Alt Decay Score 1 or 2 is enabled, two new columns appear in the Sources table: AD1 and AD2. These columns display:
- Computed Value: The decay score calculated using your configured formula (e.g., 0.847)
- Formula Breakdown: Shows the actual values used in the computation (e.g., 0.923 × e^(-0.05 × 14.5^1.5))
- Column Header Tooltip: Hover over AD1/AD2 header to see the configured formula
Key Details:
- Alt Decay uses the raw semantic score (not percentalized)
- Days since publication (t) is calculated from source's published_at vs question's asked_at
- Columns update immediately when you change configuration values while toggle is ON
- When toggle is OFF, columns display "-" for all sources
Show PCT Mode:
- When "Show PCT" is enabled, Alt Decay columns display percentalized values (0-1 range)
- Percentalization uses min-max normalization across all sources within the question
- Column headers show "(%) " suffix when PCT mode is ON
- Highest Alt Decay value becomes 1.0, lowest becomes 0.0
- In PCT mode, formula breakdown is hidden (only the percentile value is shown)
Filters
- User: Filter questions by specific user
- Classification: Filter by question type (Investigative, Temporally-Aware, etc.)
- Category: Filter by content category (politics, business, sports, etc.)
- Backstop: Filter by backstop status (All, Non-Backstop, or Backstop only)
All metrics and statistics update dynamically based on the current filter selection.
Question Cards
Each question is displayed as a card with expandable sections:
- Answer: The generated response with citation numbers (tabs for Alt Answers in header)
- Sources: Source articles used (click headers to expand)
- Suggestions: Follow-up questions
- Named Entities: People, organizations, locations identified
- Performance Metrics: Timing data for the query
- Ranking Shift Metrics: How Alt Score rankings compare to Prod rankings
Answer Section (Three-Column Layout):
The Answer section displays three columns side-by-side for easy comparison:
- Answer (Column 1): The production answer from the original query
- Alt Answer 1 (Column 2): Generate an alternative answer using top 8 sources ranked by Alt Score 1
- Alt Answer 2 (Column 3): Generate an alternative answer using top 8 sources ranked by Alt Score 2
How to use the Answer Section:
- Expand/Collapse: Click anywhere on the Answer header to toggle the section
- Side-by-Side Comparison: All three answers are visible simultaneously when expanded
- Vertical Dividers: Columns are separated by vertical lines for clarity
- Equal Width: Each column takes 33% of the width
How Alt Answers work:
- Alt Score 1 or 2 must be configured in Master Article Score Settings to enable generation
- Click "Generate Alt Answer" to call the API with the top 8 sources by Alt Rank
- Two-tier ranking ensures the top 8 sources are always non-excluded (excluded sources are ranked at the bottom)
- The answer is generated using the same question but with different source articles
- Generated answers are cached for the session (not persisted across page refreshes)
Question Badges
CATEGORY
Content category (politics, sports, etc.)
CLASSIFICATION
Question type classification
BACKSTOP [codes]
Question required fallback processing. Reason codes (e.g. [1000, 1003]) indicate which criteria triggered the backstop call. Hover over the badge to see all 7 criteria with triggered ones highlighted.
Avg Prod Cited Score: 0.XXX
Average production score of cited sources only
Avg Prod Top 8 Score: 0.XXX
Average production score of top 8 sources by rank (or all sources if fewer than 8)
Avg Alt1 Top 8 Score: 0.XXX
Average Alt1 score of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt2 Top 8 Score: 0.XXX
Average Alt2 score of top 8 sources by Alt2 rank (n/a if Alt2 not configured)
Sources Section Badges
These badges appear in the Sources section for each question:
Avg Prod Cited Domain Reliability: 0.XXX
Average domain reliability of cited sources only
Avg Prod Top 8 Domain Reliability: 0.XXX
Average domain reliability of top 8 sources by rank
Avg Prod Cited Readability: 0.XXX
Average readability of cited sources only
Avg Prod Top 8 Readability: 0.XXX
Average readability of top 8 sources by rank
Avg Alt1 Top 8 Domain Reliability: 0.XXX
Average DR of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt1 Top 8 Readability: 0.XXX
Average readability of top 8 sources by Alt1 rank (n/a if Alt1 not configured)
Avg Alt2 Top 8 Domain Reliability: 0.XXX
Average DR of top 8 sources by Alt2 rank (n/a if Alt2 not configured)
Avg Alt2 Top 8 Readability: 0.XXX
Average readability of top 8 sources by Alt2 rank (n/a if Alt2 not configured)
Sources Table
The sources table shows articles considered for the answer:
| Age | Time since article was published |
| Rnk | Article rank (lower = higher relevance) |
| Score | Combined relevance score (0-1) |
| Sem | Semantic similarity score |
| BM25 | Keyword matching score |
| Cross | Cross-encoder relevance score |
| DR | Domain Reliability score (0-100) |
| Depth | Content Depth Score (0-5) |
| Pos | Positive Sentiment |
| Neg | Negative Sentiment |
| Sent | Sentiment summary (sum of the positive and negative sentiment scores) |
| Read | Readability score |
| Decay | Time adjusted semantic score |
| Class | Article classification |
| Excl | Whether the content is ignored |
Source Row Colors
Pink background: Top-8 article NOT cited in the answer
White background: Article cited in the answer OR ranked below 8
Score Colors
0.600
High score (>= 0.5)
0.350
Medium score (0.2 - 0.5)
0.100
Low score (< 0.2)
Keyboard Shortcuts
- Escape: Close this help dialog
LLM-as-Judge Evaluation
This feature uses OpenAI's GPT-4o-mini to evaluate and compare answer quality. Located between the Answer and Sources sections in each question card.
How to Use:
- Expand the "LLM-as-Judge Evaluation" section
- Click "Evaluate Answers" button to trigger evaluation
- Evaluation includes the Production answer, plus Alt Answer 1 and/or Alt Answer 2 if generated
Prompt Modes:
- Original — Standard evaluation prompt with 7 criteria
- Alternate-1 (Core Ask Aware) — Adds a "Core Ask" reasoning step that forces the judge to identify the specific relationship/action/event before scoring. Strengthens relevance criterion (score capped at 3 if core ask not addressed) and evidence utilization (prefers evidence addressing the core ask). Comparison guidance penalizes answers that are merely broader/safer without answering the question.
- Alternate-2 (Core Ask Gate) — 5-step evaluation scaffold: (1) Extract the Core Ask with concrete-involvement framing and minimum facts, (2) Core Ask Summary returned in JSON (max 20 words — visible as a chip in the UI), (3) Coverage Check classifying each answer as direct/partial/missed (returned in JSON — visible as color-coded badge per answer), (4) Scoring Constraint with HARD binding caps on relevance (direct → 1-5, partial → ≤4, missed → ≤2) and evidence_utilization (non-core-ask evidence → ≤4, mostly contextual → ≤3), (5) Winner Constraint requiring direct-coverage answers to beat missed-coverage answers unless factual errors. Stronger comparison guidance that prohibits preferring broader/safer answers. Designed for cases where Alt-1's soft cap is insufficient.
Select prompt mode in the batch eval controls. Switching modes counts as a settings change. Each result records which prompt mode was used. The Prompt Comparison section supports pairwise comparison across all mode pairs.
Evaluation Criteria (1-5 scale):
| Relevance & Task Alignment (25%) | Does the response directly address the question, cover all sub-questions, and avoid topic drift? |
| Faithfulness / Groundedness (20%) | Are all factual claims supported by the retrieved sources without hallucination? |
| Claim–Evidence Mapping (15%) | Can each claim be explicitly mapped to supporting source evidence? |
| Temporal Correctness (15%) | Does the response correctly align with the temporal scope of the question and sources? |
| Evidence Utilization (10%) | Does the response incorporate all key relevant evidence from the sources? |
| Internal Consistency (10%) | Is the response free of logical contradictions and conflicting assertions? |
| Appropriate Uncertainty (5%) | Does the response appropriately qualify uncertainty when evidence is insufficient? |
Score Scale:
4 (VeryGood)
Excellent performance on this criterion
3 (Good)
Adequate performance with minor issues
2 (Bad)
Noticeable problems affecting quality
1 (VeryBad)
Significant failures on this criterion
Per-Criterion Reasoning:
Each criterion includes a detailed reasoning (2-3 sentences) explaining why that specific score was assigned. The reasoning appears below each criterion score in italic text.
General Commentary:
Each answer receives a holistic assessment with specific examples and observations. This appears at the bottom of each score card in a highlighted box.
Comparison Analysis:
When multiple answers are available (Prod + Alt1 and/or Alt2), the LLM provides a comparative analysis explaining which answer is best and why, considering both content quality and source relevance.
Source Context:
The evaluation considers the top 8 ranked sources (title, description, and fragment) for each answer type, using Production ranking for the Prod answer and Alt Score ranking for Alt answers.
Alt6 Scoring (Temporal-Aware Relevance)
Alt6 is the production scoring formula for source ranking. It multiplies seven factors: most factors default to 1.0 (neutral) for any given question, so a source's score is really driven by the two or three factors that matter for that particular question. For an end-to-end methodology overview see dashboards/ALT6_METHODOLOGY.md.
Formula:
Alt6 = RelevancePct × DecayFactor × AnchorFactor × WindowFactor × TemporalCompat × EntityPresence × DRImpact
Where:
• RelevancePct — relevance to the question, percentile-ranked against the session pool (always active)
• DecayFactor — source freshness (BRT/RWR/CRBN questions only); 1.0 for TAH-EA/ER
• AnchorFactor — proximity of source publish date to question's event date (TAH-EA only); 1.0 otherwise
• WindowFactor — whether source falls inside the question's time window (TAH-ER only); 1.0 otherwise
• TemporalCompat — title/description year-mention check (TAH only): match × 1.15, mismatch × 0.80, neutral × 1.00
• EntityPresence — whether the people named in the question appear in the source: title boost × 1.20, description boost × 1.12, graduated penalties when missing
• DRImpact — domain reliability adjustment (currently a no-op; all multipliers default to 1.00)
Relevance Calculation:
Relevance = wCross × pCross + wBM25 × pBM25 + wSem × pSem
Where:
• wCross = 0.75 (default; configurable in Global Controls)
• wBM25 = 0.075 (default)
• wSem = 0.175 (default; weights should sum to 1.0)
• pCross = cross-encoder percentile
• pBM25 = BM25 percentile
• pSem = semantic (bi-encoder) percentile
• RelevancePct = percentile_rank(Relevance) across the session pool — this is what enters the Alt6 formula
Temporal Destinations & Decay Parameters:
Five surviving destinations after the 2026-04-25 simplification. Each question carries a "path" tag showing how it got there (cascade, override, reroute, future).
| BRT | Breaking / Real-Time. Half-life ~24 hrs, floor 0.10. Aggressive decay for breaking news only. |
| RWR (+UNKNW, +cascade) | Recent Window / Recap. Half-life 14 days, floor 0.25. Catches: native RWR, UNKNW, Unclassified, all BRT/RWR cascade fallbacks. |
| EvtAnch/TAH | TAH Event Anchored. Anchor-centered temporal factor (no recency decay). t½=120d, floor=0.27. Recent EA → BRT/RWR via recency override; old / reference-shaped EA → CRBN via reference reroute. |
| ExpRng/TAH | TAH Explicit Range. Window-compliance scoring (in-window = 1.0, outside-window decay by distance to nearest boundary). t½=180d, floor=0.27. |
| CRBN (+Future-CRBN, +reference-rerouted) | Reference / historical / encyclopedic. Very gentle decay. t½=180d, floor=0.70. Catches: native CRBN, Future-CRBN, and TAH questions caught by the reference reroute (EA / ER / Comp → CRBN). Used for fiction references, old historical lookups, definitional / structural questions. |
Decay formula: DecayFactor = max(floor, 0.5^(age_days / half_life))
EvtAnch/TAH anchor formula: AnchorFactor = max(floor, exp(-ln2 × |pub_date - anchor_date| / half_life)). Symmetric around anchor date.
Path Tags (how a question got to its destination):
| cascade from X | BRT/RWR question backed off to RWR/CRBN because too few sources were within age-to-floor. |
| recency override (X→Y) | TAH question's anchor/range was recent enough that BRT/RWR decay was applied instead of anchor/window. |
| reroute from X | EA/ER/Comp question rerouted to CRBN (e.g. fiction references, very old historical lookups). |
| Future-X from Y | TAH question about a future event routed to BRT/RWR/CRBN based on how far ahead the event is. |
| UNKNW / Unclassified | Question had no temporal subclass; collapsed to RWR with this path tag. |
Unknown Publish Date Handling:
Sources with no publish date go through the normal decay calculation, which clamps at the subclass's floor:
| BRT | 0.10 (floor) |
| RWR | 0.25 (floor) |
| TAH (EA / ER) | 1.0 — anchor / window scoring handles temporal relevance; estimated dates get an additional ×0.80 penalty (the TAH estimatedDatePenalty) |
| CRBN | 0.70 (floor) |
Sources with no real publish date show "(no date)" indicator in purple.
UI Indicators:
- Rel(X.XXX)→RelPct(X.XX) - Shows raw Relevance score and its percentile rank in the session pool
- q-rank: N/M - Shows the source's rank within this question (N of M sources, sorted by Relevance descending)
- ⓘ icons - Hover for detailed breakdown tooltips showing raw values and calculations
- Session pool - Displayed in the Alt6 config panel, shows total sources used for percentile calculation
- "inputs missing" - Shown when a source lacks cross_pct or bm25_pct values needed for Alt6
- Decay badge - Shows temporal subclass and decay parameters near the question/answer
Score Deltas:
- ScoreΔ = Alt6 - ProdScore (positive = Alt6 scores higher)
- RankΔ = ProdRank - Alt6Rank (positive = Alt6 ranks the source higher)
Relevance Health Diagnostic
A per-question diagnostic that summarizes the absolute strength of the retrieved candidate set using RAW Relevance (not RelPct), to answer: "Do we have enough strong sources, or should we retrieve more / broaden search?"
Status Levels:
🟢 Strong
max ≥ 0.75 AND count(Relevance ≥ 0.70) ≥ 3
🟡 Thin
max ≥ 0.65 AND count(Relevance ≥ 0.60) ≥ 2
🔴 Weak — expand sources
Otherwise (candidate set may need broader search)
Detail Panel (click to expand):
- Counts: N total sources, N valid (with pCross and pBM25)
- Distribution: max, p75, median of Relevance
- Strength: Count of sources ≥0.70, ≥0.60, ≥0.50
Note: This diagnostic uses raw Relevance (time-independent signal: meaning + keywords) to detect when the candidate set is weak even if ranks or percentiles look fine. It does not affect ranking or selection.
Batch LLM Evaluation
Run LLM-as-Judge evaluation across multiple questions in one batch. Located between the analytics sections and the question list.
- Evaluation Weights: Set per-criterion weights (must sum to 100%). These weights are applied to all questions in the batch.
- Batch Size: Choose how many questions to evaluate (1, 5, 10, 20, or 30).
- Source Scope: "All sources" uses unfiltered Alt6 Raw sources.
- Start Evaluation: Picks the next N uncomputed questions from the current filtered view. For each question: generates an Alt6 Raw answer, then runs LLM-as-Judge on both the production answer and the Alt6 Raw answer.
- Non-recompute: Questions already evaluated in the Last Run with the same weights are skipped.
- Settings Change Warning: If you change weights, a confirmation dialog appears before starting.
- Summary Grid: Shows all evaluated questions with scores, deltas, and metadata. Click a question to scroll to its card and auto-expand the LLM Judge section.
- Breakdowns: Group results by temporal subclass, category, classification, and delta bins.
- Charts: Grouped bar chart (avg scores by subclass) and scatter plot (Prod vs Alt6 totals with 45-degree reference line).
- Export: Download results as CSV (flattened) or JSON (full run object). Includes run metadata, weights, and per-criterion scores.
- Run History: Last Run and Previous Run are preserved in browser storage. On page reload, persisted runs are restored. Incomplete runs from crashes can be recovered.
Last updated: March 3, 2026
Technical Documentation (for Claude Code)
Last updated: 2026-01-31
Dashboard Startup Commands
There are two dashboards in this project:
- Ask Dashboard (Q&A monitoring) - Port 5002
.venv/bin/python3 dashboards/run_ask_dashboard.py
URL: http://localhost:5002
- Article Dashboard (streaming article monitor) - Port 8083
.venv/bin/python3 dashboards/backend/flask_dashboard.py
URL: http://localhost:8083
Important: Dashboard File Locations
Warning: There is a deprecated flask_dashboard.py in the project root. Do NOT use it.
- Correct location:
dashboards/backend/flask_dashboard.py - Uses templates from dashboards/templates/
- Deprecated (DO NOT USE):
flask_dashboard.py (root) - Uses old dashboard_template.html with mismatched API endpoints
Directory Structure
dashboards/
├── backend/
│ └── flask_dashboard.py # Article Dashboard backend (port 8083)
├── templates/
│ ├── ask_dashboard.html # Ask Dashboard template
│ └── main_dashboard.html # Article Dashboard template
└── run_ask_dashboard.py # Ask Dashboard backend (port 5002)
Restarting Dashboards
To kill and restart a dashboard:
# Ask Dashboard
pkill -9 -f "run_ask_dashboard" 2>/dev/null; sleep 1; .venv/bin/python3 dashboards/run_ask_dashboard.py &
# Article Dashboard
pkill -9 -f "flask_dashboard" 2>/dev/null; sleep 1; .venv/bin/python3 dashboards/backend/flask_dashboard.py &