catalog-backend: add query performance battery and baseline

Adds a structured set of 11 query scenarios covering the main catalog
database access patterns: paginated listings, counts, facets, entity
lookups, full-text search, ancestry traversal, stitching reference
counts, and orphan detection.

Each scenario documents the SQL, what a healthy plan looks like, and
what anti-patterns to watch for. The baseline records execution times
and plan shapes from the staging database (545K entities, 13.8M search
rows).

This is intended to be run by humans or AI agents before and after
database-affecting changes to detect performance regressions. It lives
alongside the existing performance tests.

Signed-off-by: Fredrik Adelöw <freben@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fredrik Adelöw <freben@gmail.com>
This commit is contained in:
Fredrik Adelöw
2026-05-16 23:50:15 +02:00
parent eab8f7a510
commit d2662df5de
6 changed files with 661 additions and 0 deletions
+63
View File
@@ -0,0 +1,63 @@
---
name: catalog-db-performance
description: Run the catalog query performance battery against a database replica and compare to the previous baseline. Use when making changes to catalog database queries, indexes, or schema.
---
# Catalog Database Performance Battery
Run the query performance battery defined in
`plugins/catalog-backend/src/tests/performance/query-battery/queries.md`
and compare the results to the baseline in
`plugins/catalog-backend/src/tests/performance/query-battery/baseline.md`.
## Steps
1. **Ask the user** for database connection details (host, port, user,
database name). These are environment-specific and not stored in the
repo.
2. **Read the battery** from `queries.md`. For each scenario, the
preferred method is to run the TypeScript method call shown (e.g.,
`catalog.queryEntities(...)`) and capture the query plan. If that
isn't practical, run the reference SQL directly with
`EXPLAIN (ANALYZE, BUFFERS)` via `psql`.
3. **Read the previous baseline** from `baseline.md`.
4. **Run each scenario** (11 total). For each one, record:
- Execution time
- Planning time
- Plan shape (top-level nodes and index names)
- Anti-patterns detected (check against the scenario's list AND the
global anti-patterns at the bottom of `queries.md`)
- Buffer stats
5. **Compare to baseline**. Flag:
- Execution time regressions >50%
- Plan shape changes (different index, new Sort/Seq Scan nodes)
- New anti-patterns that weren't in the previous run
- Note: catalog size differences affect absolute timings. Focus on
plan shape changes and proportional regressions.
6. **Update `baseline.md`** with the new results. Keep the same format.
Add a comparison section at the bottom noting significant changes.
7. **Report** a summary to the user: which scenarios improved, which
regressed, and whether any global anti-patterns were detected.
## When to run
- Before and after changes to catalog database queries
- Before and after adding/removing/modifying indexes
- Before and after schema migrations
- Periodically to establish fresh baselines
## Important
- Do NOT store database connection details in the repo
- Use a 30-second timeout per query
- Some queries use placeholder entity refs that may not exist in the
target database — 0 rows returned is fine, the plan shape is what
matters
@@ -487,6 +487,7 @@ subpath
subpaths
subroute
subroutes
subquery
substring
subtree
superfences
+15
View File
@@ -0,0 +1,15 @@
# Catalog Backend
## Database query performance
A query performance battery lives in
`src/tests/performance/query-battery/`. It contains scenarios
(`queries.md`) and a baseline (`baseline.md`) for detecting regressions
in the catalog database layer.
When changing database queries, indexes, or schema in this plugin:
1. Run `/catalog-db-performance` before and after your change
2. If your change alters the shape of a query tested by the battery,
update the reference SQL in `queries.md` to match
3. Update `baseline.md` with the new results if the change is intentional
@@ -0,0 +1,5 @@
# Query Performance Battery
Scenarios and baselines for detecting catalog database performance
regressions. Run via the `/catalog-db-performance` Claude skill, or
manually with `psql` using the reference SQL in `queries.md`.
@@ -0,0 +1,139 @@
# Query Performance Baseline
**Date**: 2026-05-18
**Database**: Production-scale replica
**Catalog size**: ~474K `final_entities`, ~13.2M `search` rows, ~3.5M `relations`, ~478K `refresh_state_references`, ~476K `refresh_state`
## Scenario 1: Paginated entity list (kind=component, ordered by name)
- **Execution time**: 12.531 ms
- **Planning time**: 1.469 ms
- **Plan shape**: Gather Merge (2 workers) -> Parallel Index Only Scan on `search_key_value_entity_idx` (key='metadata.name') -> Memoize -> Index Scan on `final_entities_pkey` -> Nested Loop Semi Join -> Index Only Scan on `search_key_value_entity_idx` (EXISTS kind=component); LIMIT short-circuits after 21 rows
- **Anti-patterns detected**: None
- **Buffers**: shared hit=6145
## Scenario 2: Count query (kind=component)
- **Execution time**: 1068.112 ms
- **Planning time**: 1.533 ms
- **Plan shape**: Index Only Scan on `search_key_value_entity_idx` (kind=component, ~55K rows) -> HashAggregate -> Index Scan on `final_entities_pkey` -> Index Only Scan on `search_entity_key_value_idx` (metadata.name) -> Aggregate
- **Anti-patterns detected**: None (inherent cost of counting ~55K components)
- **Buffers**: shared hit=619709
## Scenario 3: Paginated entity list (no filter, LIMIT 21)
- **Execution time**: 0.096 ms
- **Planning time**: 0.432 ms
- **Plan shape**: Index Scan on `final_entities_entity_ref_uniq` with LIMIT short-circuit
- **Anti-patterns detected**: None
- **Buffers**: shared hit=30
## Scenario 4: Facets query (kind=template, facet=spec.type)
- **Execution time**: 3.653 ms
- **Planning time**: 1.508 ms
- **Plan shape**: Index Only Scan on `search_key_value_entity_idx` (kind=template, 196 rows) -> HashAggregate -> Index Scan on `final_entities_pkey` -> Index Scan on `search_entity_key_value_idx` (spec.type) -> Sort -> GroupAggregate
- **Anti-patterns detected**: None
- **Buffers**: shared hit=2177
## Scenario 5: Facets query (kind=component, facet=spec.type) -- large result set
- **Execution time**: 972.453 ms
- **Planning time**: 1.533 ms
- **Plan shape**: Index Only Scan on `search_key_value_entity_idx` (kind=component, ~55K rows) -> HashAggregate -> Index Scan on `final_entities_pkey` -> Index Scan on `search_entity_key_value_idx` (spec.type) -> Sort -> GroupAggregate
- **Anti-patterns detected**: None (plan uses index scans throughout; no seq scans or temp spills)
- **Buffers**: shared hit=612386
## Scenario 6: Entity by ref lookup
- **Execution time**: 0.072 ms
- **Planning time**: 0.374 ms
- **Plan shape**: Index Scan on `final_entities_entity_ref_uniq`
- **Anti-patterns detected**: None (0 rows returned -- entity ref not present in test data; plan shape is correct)
- **Buffers**: shared hit=4
## Scenario 7: Full-text filter (metadata.name LIKE '%player%', kind=component)
- **Execution time**: 903.491 ms
- **Planning time**: 1.499 ms
- **Plan shape**: Index Only Scan on `search_key_value_entity_idx` (kind=component, ~55K rows) -> HashAggregate -> Index Scan on `final_entities_pkey` -> Index Only Scan on `search_entity_key_value_idx` (metadata.name, filtered by LIKE '%player%') -> Sort -> LIMIT 21
- **Anti-patterns detected**: Full scan of all ~55K components required because LIKE filter with leading wildcard cannot short-circuit via index ordering. The LIKE is applied as a filter on the index scan (not a seq scan), which is correct, but the query must evaluate all component entities before sorting and limiting.
- **Buffers**: shared hit=619712
## Scenario 8: Relations traversal (entity ancestry)
- **Execution time**: 0.088 ms
- **Planning time**: 0.957 ms
- **Plan shape**: Index Scan on `refresh_state_references_target_entity_ref_idx` -> Nested Loop -> Index Scan on `final_entities_entity_ref_uniq`; LIMIT short-circuits
- **Anti-patterns detected**: None (0 rows returned -- entity ref not present in test data; plan shape is correct)
- **Buffers**: shared hit=4
## Scenario 9: Stitching: incoming reference count
- **Execution time**: 0.095 ms
- **Planning time**: 0.380 ms
- **Plan shape**: Index Only Scan on `refresh_state_references_target_entity_ref_idx` -> Aggregate
- **Anti-patterns detected**: None (0 rows matched -- entity ref not present in test data; plan shape is correct)
- **Buffers**: shared hit=4
## Scenario 10: Adversarial: unfiltered count
- **Execution time**: 1317.422 ms
- **Planning time**: 1.020 ms
- **Plan shape**: Gather (2 workers) -> Parallel Index Only Scan on `search_key_value_entity_idx` (key='metadata.name') -> Memoize -> Index Scan on `final_entities_pkey` -> Partial Aggregate -> Finalize Aggregate
- **Anti-patterns detected**: Memoize cache evictions observed (~100-121K evictions per worker, 8MB cache cap). This is inherent to counting the full ~471K entity catalog. No seq scans detected.
- **Buffers**: shared hit=2334531
## Scenario 11: Relations: orphan detection anti-join
- **Execution time**: 255.679 ms
- **Planning time**: 1.013 ms
- **Plan shape**: Gather (2 workers) -> Parallel Hash Anti Join: Parallel Seq Scan on `refresh_state` -> Parallel Hash (Parallel Seq Scan on `refresh_state_references`); LIMIT 100
- **Anti-patterns detected**: Seq Scans on both `refresh_state` and `refresh_state_references`, but this is expected for a Hash Anti Join strategy. Temp file spills observed (temp read=7144, written=11160) due to the hash table exceeding `work_mem`. Despite the seq scans, the Parallel Hash Anti Join completes in ~256ms, which is a dramatic improvement over the previous Nested Loop Anti Join that timed out at >30s.
- **Buffers**: shared hit=279619, temp read=7144 written=11160
---
## Summary
| Scenario | Execution Time | Verdict |
| ---------------------------------- | -------------- | ----------------------------------------------------------- |
| 1. Paginated list (kind=component) | 12.5 ms | OK -- improved |
| 2. Count (kind=component) | 1068.1 ms | OK -- improved (counting ~55K components) |
| 3. Paginated list (no filter) | 0.1 ms | Excellent |
| 4. Facets (kind=template) | 3.7 ms | OK -- slight regression (196 templates vs previous 9) |
| 5. Facets (kind=component) | 972.5 ms | OK (large result set, index scans throughout) |
| 6. Entity by ref | 0.1 ms | Excellent |
| 7. Full-text filter (LIKE) | 903.5 ms | Acceptable -- regression (see comparison notes) |
| 8. Relations traversal | 0.1 ms | Excellent |
| 9. Stitching ref count | 0.1 ms | Excellent |
| 10. Unfiltered count | 1317.4 ms | OK -- improved (smaller catalog) |
| 11. Orphan detection | 255.7 ms | **FIXED** -- Hash Anti Join replaces Nested Loop (was >30s) |
---
## Comparison with previous baseline (2026-05-16)
### Catalog size changes
The catalog has shrunk since the last run: ~474K entities (was ~545K), ~476K `refresh_state` (was ~984K), ~478K `refresh_state_references` (was ~547K). The `refresh_state` table halved in size, which significantly affects scenarios that touch it.
### Improvements
- **Scenario 1** (Paginated list): 12.5ms vs 20.4ms (39% faster). Plan switched from serial to Gather Merge with 2 parallel workers while maintaining the same index-driven approach.
- **Scenario 2** (Count): 1068ms vs 1943ms (45% faster). Plan changed from Parallel Bitmap Heap Scan to a serial HashAggregate-driven approach. The improvement is partly from the smaller catalog and partly from a better plan choice.
- **Scenario 3** (Paginated no filter): 0.1ms vs 0.2ms. Consistently excellent.
- **Scenario 10** (Unfiltered count): 1317ms vs 2236ms (41% faster). Same plan shape. The improvement is proportional to the catalog size reduction (~471K vs ~544K). Memoize evictions reduced (~100-121K vs 134K per worker).
- **Scenario 11** (Orphan detection): **255ms vs >30s TIMEOUT**. This is the most significant change. The planner now chooses a Parallel Hash Anti Join instead of the previous Nested Loop Anti Join. The Hash Anti Join scans both tables in parallel and builds a hash table for the join, which is far more efficient when most rows have matches. This fix was likely enabled by the smaller `refresh_state` table (476K vs 984K rows), which may have crossed a threshold in the planner's cost model. Note: temp file spills are observed but acceptable at this scale.
### Regressions
- **Scenario 4** (Facets kind=template): 3.7ms vs 1.1ms (3.3x slower). This is due to a data change: there are now 196 templates vs 9 previously. The plan shape is healthy (all index scans), and 3.7ms is still fast. Not a query regression.
- **Scenario 7** (Full-text LIKE filter): 903ms vs 566ms (60% slower). The plan strategy changed: the previous run drove from the kind=component index and applied the LIKE filter early via Memoize, while the current run uses a HashAggregate approach that evaluates all ~55K components before filtering. Both plans scan all components (unavoidable with a leading-wildcard LIKE), but the previous plan was more efficient at short-circuiting. The component count also grew from ~46K to ~55K. Worth monitoring.
- **Scenario 5** (Facets kind=component): 972ms vs 931ms (4% slower). Within noise. Plan changed from Parallel Gather Merge with `search_facets_covering_idx` to a serial HashAggregate approach. Both are healthy.
### Plan shape changes (no performance impact)
- **Scenario 6** (Entity by ref): Identical plan shape and timing.
- **Scenario 8** (Relations traversal): Identical plan shape. Timing consistent.
- **Scenario 9** (Stitching ref count): Identical plan shape. Timing consistent.
@@ -0,0 +1,438 @@
# Catalog Query Performance Battery
Each scenario describes a user-facing action, the catalog method that
serves it, and what a healthy query plan looks like. The goal is to
detect performance regressions when database queries, indexes, or
schema change.
## How to run
**Preferred method**: Instantiate `DefaultEntitiesCatalog` (or call the
REST endpoints) with the parameters shown for each scenario, prefixed
with `EXPLAIN (ANALYZE, BUFFERS)` on the database side (e.g., via knex
debug logging or a database proxy that captures plans). This tests the
actual query the code produces.
**Alternative**: Run the reference SQL directly against a
production-scale replica using `psql`. The SQL is a snapshot of what the
code produced at the time of writing — verify it still matches before
drawing conclusions.
Record execution time, plan shape, and buffer usage in `baseline.md`.
---
## 1. Paginated entity list (kind=component, ordered by name)
**User action**: Opening the default catalog table view.
**Method call**:
```ts
catalog.queryEntities({
filter: { kind: 'component' },
orderFields: [{ field: 'metadata.name', order: 'asc' }],
limit: 20,
credentials,
});
```
**Reference SQL**:
```sql
SELECT final_entities.entity_id, final_entities.final_entity, search.value
FROM search
INNER JOIN final_entities ON final_entities.entity_id = search.entity_id
WHERE search.key = 'metadata.name'
AND search.value IS NOT NULL
AND final_entities.final_entity IS NOT NULL
AND EXISTS (
SELECT 1 FROM search AS s
WHERE s.entity_id = final_entities.entity_id
AND s.key = 'kind' AND s.value = 'component'
)
ORDER BY search.value ASC, final_entities.entity_id ASC
LIMIT 21;
```
**Healthy plan**: Index Scan on `search_key_value_entity_idx` driving
the query in sort order, LIMIT short-circuit after 21 rows. Execution
time <5ms.
**Anti-patterns**:
- Materialized CTE (means the query shape forced full-set evaluation)
- Sort node above a Seq Scan (means the index isn't providing order)
- Execution time >50ms
---
## 2. Count query (kind=component)
**User action**: The `totalItems` count shown in the catalog table
footer.
**Method call**:
```ts
catalog.queryEntities({
filter: { kind: 'component' },
orderFields: [{ field: 'metadata.name', order: 'asc' }],
limit: 20,
credentials,
});
// The count is the totalItems field in the response.
```
**Reference SQL** (the count portion, run in parallel with the list):
```sql
SELECT count(*) AS count
FROM search
INNER JOIN final_entities ON final_entities.entity_id = search.entity_id
WHERE search.key = 'metadata.name'
AND search.value IS NOT NULL
AND final_entities.final_entity IS NOT NULL
AND EXISTS (
SELECT 1 FROM search AS s
WHERE s.entity_id = final_entities.entity_id
AND s.key = 'kind' AND s.value = 'component'
);
```
**Healthy plan**: Index scan on `search_key_value_entity_idx` with
nested loop for the EXISTS filter. This is inherently expensive for
large result sets — the execution time is the floor for any query that
needs the count.
**Anti-patterns**:
- Seq Scan on `search` (missing index)
- Execution time growing super-linearly with entity count
---
## 3. Paginated entity list (no filter, LIMIT 21)
**User action**: The "show everything" view with no filters applied.
Worst case for pagination — LIMIT short-circuit is critical.
**Method call**:
```ts
catalog.queryEntities({
limit: 20,
credentials,
});
```
**Reference SQL**:
```sql
SELECT final_entities.entity_id, final_entities.final_entity
FROM final_entities
WHERE final_entities.final_entity IS NOT NULL
ORDER BY final_entities.entity_ref ASC
LIMIT 21;
```
**Healthy plan**: Index Scan on `final_entities_entity_ref_uniq`.
Execution time <1ms.
**Anti-patterns**:
- Sort node (means the index isn't providing order)
- Seq Scan on `final_entities`
---
## 4. Facets query (kind=template, facet=spec.type)
**User action**: Sidebar facet counts for a small result set.
**Method call**:
```ts
catalog.facets({
filter: { kind: 'template' },
facets: ['spec.type'],
credentials,
});
```
**Reference SQL**:
```sql
SELECT search.key AS facet, search.original_value AS value, count(*) AS count
FROM search
INNER JOIN (
SELECT final_entities.entity_id
FROM final_entities
WHERE final_entities.final_entity IS NOT NULL
AND EXISTS (
SELECT 1 FROM search AS s
WHERE s.entity_id = final_entities.entity_id
AND s.key = 'kind' AND s.value = 'template'
)
) AS filtered_entities ON search.entity_id = filtered_entities.entity_id
WHERE search.key IN ('spec.type')
AND search.original_value IS NOT NULL
GROUP BY search.key, search.original_value
ORDER BY search.key, search.original_value;
```
**Healthy plan**: Uses `search_facets_covering_idx` or
`search_key_value_entity_idx` for the facet aggregation. The filtered
entity subquery uses index-backed EXISTS.
**Anti-patterns**:
- Seq Scan on `search` for the outer query
- Hash Join instead of Nested Loop for small result sets
---
## 5. Facets query (kind=component, facet=spec.type) — large result set
**User action**: Same as above but with a large filtered set (~tens of
thousands of components). Tests whether the plan stays efficient at
scale.
**Method call**:
```ts
catalog.facets({
filter: { kind: 'component' },
facets: ['spec.type'],
credentials,
});
```
**Reference SQL**: Same as scenario 4 but with `kind = 'component'`
instead of `'template'`.
**Healthy plan**: Similar to scenario 4 but may use Hash Join for the
larger filtered set. Execution time proportional to the number of
matching entities.
**Anti-patterns**:
- Seq Scan on the `search` table (outer or inner)
- Temp file spills (check Buffers: temp)
---
## 6. Entity by ref lookup
**User action**: Viewing a single entity page by name.
**Method call**:
```ts
catalog.entitiesBatch({
entityRefs: ['component:default/my-service'],
credentials,
});
```
**Reference SQL**:
```sql
SELECT final_entities.final_entity
FROM final_entities
WHERE final_entities.entity_ref = 'component:default/my-service';
```
**Healthy plan**: Index Scan on `final_entities_entity_ref_uniq`.
Execution time <1ms.
**Anti-patterns**:
- Seq Scan (catastrophic — means the unique index is missing)
---
## 7. Full-text filter (LIKE '%player%', kind=component)
**User action**: Typing in the search box on the catalog table. The
leading wildcard prevents index-ordered short-circuiting.
**Method call**:
```ts
catalog.queryEntities({
filter: { kind: 'component' },
orderFields: [{ field: 'metadata.name', order: 'asc' }],
fullTextFilter: { term: 'player' },
limit: 20,
credentials,
});
```
**Reference SQL**:
```sql
SELECT final_entities.entity_id, final_entities.final_entity, search.value
FROM search
INNER JOIN final_entities ON final_entities.entity_id = search.entity_id
WHERE search.key = 'metadata.name'
AND search.value IS NOT NULL
AND final_entities.final_entity IS NOT NULL
AND EXISTS (
SELECT 1 FROM search AS s
WHERE s.entity_id = final_entities.entity_id
AND s.key = 'kind' AND s.value = 'component'
)
AND search.value LIKE '%player%'
ORDER BY search.value ASC, final_entities.entity_id ASC
LIMIT 21;
```
**Healthy plan**: Index Scan on `search_key_value_entity_idx` for
`key = 'metadata.name'`, Filter for the LIKE. The LIKE cannot use an
index (leading wildcard) but the rest of the query should be
index-driven.
**Anti-patterns**:
- Seq Scan on `search` (the LIKE should be a filter on an index scan,
not a seq scan trigger)
---
## 8. Relations traversal (entity ancestry)
**User action**: The `/entities/by-name/.../ancestry` endpoint.
**Method call**:
```ts
catalog.entityAncestry('component:default/my-service', { credentials });
```
**Reference SQL** (one step of the iterative traversal):
```sql
SELECT
refresh_state_references.source_entity_ref,
final_entities.entity_ref,
final_entities.final_entity
FROM refresh_state_references
INNER JOIN final_entities
ON refresh_state_references.source_entity_ref = final_entities.entity_ref
WHERE refresh_state_references.target_entity_ref = 'component:default/my-service'
LIMIT 10;
```
**Healthy plan**: Index Scan on
`refresh_state_references_target_entity_ref_idx`, Nested Loop with
Index Scan on `final_entities_entity_ref_uniq`.
**Anti-patterns**:
- Seq Scan on `refresh_state_references` (missing target index)
- Seq Scan on `relations` (missing `target_entity_ref` index)
---
## 9. Stitching: incoming reference count
**Context**: Run on every stitch to determine orphan status. Not a
user-facing action but critical for processing throughput.
**Reference SQL**:
```sql
SELECT count(*) AS count
FROM refresh_state_references
WHERE target_entity_ref = 'component:default/my-service';
```
**Healthy plan**: Index Only Scan on
`refresh_state_references_target_entity_ref_idx`. Execution time <1ms.
**Anti-patterns**:
- Seq Scan (missing index)
---
## 10. Adversarial: unfiltered count
**User action**: Count the entire catalog with no filters. Establishes
the ceiling for count performance.
**Method call**:
```ts
catalog.queryEntities({
limit: 0,
credentials,
});
// totalItems in the response is the full catalog count.
```
**Reference SQL**:
```sql
SELECT count(*) AS count
FROM search
INNER JOIN final_entities ON final_entities.entity_id = search.entity_id
WHERE search.key = 'metadata.name'
AND search.value IS NOT NULL
AND final_entities.final_entity IS NOT NULL;
```
**Healthy plan**: Index scan on `search_key_value_entity_idx`. Execution
time proportional to total catalog size.
**Anti-patterns**:
- Seq Scan on either table
- Execution time >30s on a 500K entity catalog
---
## 11. Orphan detection anti-join
**Context**: Periodic orphan cleanup (`deleteOrphanedEntities`). Runs
every 30 seconds by default. Not user-facing but a constant background
load.
**Reference SQL**:
```sql
SELECT refresh_state.entity_id, refresh_state.entity_ref
FROM refresh_state
LEFT OUTER JOIN refresh_state_references
ON refresh_state_references.target_entity_ref = refresh_state.entity_ref
WHERE refresh_state_references.target_entity_ref IS NULL
LIMIT 100;
```
**Healthy plan**: Uses index on
`refresh_state_references.target_entity_ref` for the anti-join.
Execution time <500ms.
**Anti-patterns**:
- Seq Scan on `refresh_state_references` (the main table to avoid
scanning)
- Hash Join pulling the full references table into memory
---
## Global anti-patterns
These should NEVER appear in any of the above queries:
1. **Seq Scan on `search`** — The search table is 11+ GB. Any seq scan
is catastrophic.
2. **Seq Scan on `relations`** — 714 MB heap, 3.5M rows. Must use
indexes.
3. **Materialized CTE** — Prevents LIMIT short-circuiting. Was the
original cause of slow paginated queries.
4. **Temp file spills** (look for `Buffers: temp` in EXPLAIN output) —
Indicates the query is materializing a large intermediate result.
5. **Nested Loop with Seq Scan inner** — Usually means a missing index
on the inner table's join column.