Mongo / Mongoose Hardening v1
Overviewβ
This plan addresses high-severity correctness, performance, and tenant-safety issues uncovered during a Mongo/Mongoose audit of attunelogic-api. Every phase is designed so code and data ship together β no maintenance windows, no big-bang migrations. Each phase is independently revertible and gated on telemetry from the prior phase.
Canonical decisions agreed up front:
parentCompanywill be unified toSchema.Types.ObjectIdwithref: "Customer"across every model.- Phase 6 (the unification) pilots on a single internal tenant first, then expands per-tenant, then globally.
- Phase 3 (cascading-hook removal) uses MongoDB transactions for billing/leg/job atomicity; transaction boundaries are designed in Phase 3a as part of the recalc service.
- This document is the source of truth for sequencing; phase numbers are referenced from PR titles.
Operating Principlesβ
- Additive first, destructive last. Pattern: add the new path β dual-read/dual-write β backfill β cut over β remove the old path.
- No offline migrations. All backfills are online, idempotent, resumable scripts with checkpoints in a
migrationscollection. - Feature flags for risky cutovers. Use the existing tenant feature-flag plumbing (
src/services/config/default-configs/feature-flags.js) to flip behavior instantly. - Atlas rolling index builds only. Never call
syncIndexes()on app boot in production. Index rollouts use thescripts/manage-indexes.jspath during off-peak windows. - Release gates. Each phase ships behind its own PR + release notes. No phase merges until the prior phase has been in production β₯ 24h with green telemetry.
Phase Summaryβ
| Phase | Goal | Est. | Depends on | Risk | Reversible |
|---|---|---|---|---|---|
| 0 | Foundation & safety nets | 1d | β | Low | Yes |
| 1 | Query-safety hot patches | Β½d | 0 | Low | Yes |
| 2 | Index coverage | 1β2d | 0 | Low | Yes |
| 3 | Kill cascading Leg/Job hooks | 1w | 0, 2 | Medium | Yes (flag) |
| 4 | Remove Invoice.post('find') | 2β3d | 3 | Medium | Yes |
| 5 | Retire mongoose-autopopulate | 1β2w | 0 | LowβMedium per model | Yes |
| 6 | Unify parentCompany to ObjectId | 3β4w | 0, 5 | High | Yes (dual-read) |
| 7 | Tenant-scoped uniques | 2d | 6 (partial) | Low | Yes |
| 8 | Ergonomics & hygiene | 1w | β | Low | Yes |
Total: ~6β8 engineering weeks. Any single week's output can ship to production on its own.
Phase 0 β Foundation & safety netsβ
Pure additive scaffolding the rest of the plan needs. No data changes.
Changes:
src/db/index.js- Set
mongoose.set("strictQuery", true)explicitly. - Set
mongoose.set("sanitizeFilter", true)globally to defend user-supplied filters. - Add pool/timeout options:
maxPoolSize,serverSelectionTimeoutMS,socketTimeoutMS,maxIdleTimeMS. - Add
SIGTERMhandler mirroring the existingSIGINTpath so containers drain cleanly.
- Set
- New
src/utils/tenantQuery.jsnormalizeTenantFilter(value)returns{ $in: [stringForm, objectIdForm] }when the value is a valid 24-char hex. Not yet wired up; load-bearing for Phase 6.
- New
src/models/plugins/tenantPlugin.js- Adds
parentCompanyindex if missing, logs/throws when a non-super-admin query omitsparentCompany. Modes:off | warn | throw. Defaultoff; enabled per env viaTENANT_QUERY_STRICT.
- Adds
- Expand
tests/security/tenantIsolation.test.jsto coverInvoice,Payout,Manifest,Route,Media,MaintenancegetById endpoints.
Backward compat: fully additive. No schema or data impact. Rollback: single PR revert.
Phase 1 β Query-safety hot patchesβ
Real security bugs that should not wait for the rest of the plan.
Changes:
src/controllers/legs/index.jsβ thecustomQuerymerge currently lets a client overrideparentCompanyviaJSON.parse. Fix:- Whitelist allowed keys from the parsed object.
- Re-apply
{ parentCompany }after the merge.
- Audit every other
JSON.parse/ spread-merge-into-Mongo-filter (start withsrc/controllers/reports.js). - Confirm
sanitizeFilterfrom Phase 0 is active per query (Mongoose 8 honors it).
Backward compat: legitimate client filters continue to work; only malicious overrides stop working. Rollback: trivial.
Phase 2 β Index coverageβ
Indexes ship before code that depends on them. Atlas builds compounds online.
Changes:
- Extend
scripts/manage-indexes.js(or addscripts/sync-mongoose-indexes.js) that runsModel.syncIndexes()for the listed models, with explicit dry-run, progress logging, and per-model gating. Do not wiresyncIndexes()into app boot. - Add the following indexes to the schemas (additive only; keep existing ones):
Leg:{ parentCompany: 1, driver: 1, appointmentDate: -1 }{ parentCompany: 1, jobId: 1, sortKey: 1 }{ parentCompany: 1, manifestId: 1 }{ parentCompany: 1, payoutId: 1 }{ parentCompany: 1, status: 1, appointmentDate: -1 }
Job:{ parentCompany: 1, status: 1, transactionDate: -1 }{ parentCompany: 1, client: 1 }{ parentCompany: 1, invoiceId: 1 }
Invoice:{ parentCompany: 1, status: 1, dueDate: -1 }{ parentCompany: 1, client: 1 }
Maintenance,Payout,Media,Accessory: at minimum{ parentCompany: 1, dateCreated: -1 }.Client:{ parentCompany: 1, active: 1, name: 1 }.
- Remove
parentCompanyfrom theClienttext-index weights (no type change, just a weight cleanup).
Backward compat: pure additive. If a build is slow or expensive we abort and drop the partially-built index. Rollback: drop the new indexes; existing query paths continue to work (just slower). Gate: Atlas slow-query log shows reduction before proceeding to Phase 3.
Phase 3 β Kill cascading hooks on Leg / Jobβ
The highest-ROI correctness fix. Three sub-phases so each deploy is independently safe.
3a. Introduce explicit recalc services (additive)β
- New
src/services/jobs/recalculateJobDuration.js(pure function, acceptsjobIdand optionalsession).- Source of truth should be the
Legcollection (Leg.find({ jobId, parentCompany })), notjob.legs, because some flows can create or update legs without keeping the embedded job leg array perfectly synchronized. - Query narrowly: select only fields needed for duration math and populate only route timing fields (
durationSeconds,fullData.duration). PreferdurationSecondsfirst, matching scheduler timing behavior. - Run only when duration-affecting data changes: leg create/delete, leg route changes, or route timing changes. Do not run on driver/status/note-only leg updates.
- Expected write-path cost: one indexed leg query, narrow route timing populate, and one
Job.updateOne. This should be no slower than the current hidden hook chain and more predictable as long as Phase 2 indexes are in place. - If high-leg-count jobs show measurable latency, add a leg-level duration snapshot so job duration can be summed without route populate.
- Source of truth should be the
- New
src/services/legs/recomputeLegTotals.js. - New
src/services/invoices/recalculateInvoiceTotals.js(also feeds Phase 4). - Define transaction boundaries:
- Leg write + dependent Job recalc β one transaction.
- Invoice write + Invoice totals recalc β one transaction.
- Payout creation + Leg
payoutIdassignment β one transaction. - Use
session.withTransaction()(Atlas replica set required β already true).
- Unit tests; integration test that proves each new service produces identical results to the pre-hook version against seeded fixtures.
3b. Feature-flag the hooks off and call services explicitlyβ
- Introduce
features.legacyMongoHooksflag (defaulttrue, per tenant). - Wrap every
Leg.pre("save"|"findOneAndUpdate"|"create")andjobMixinhook inif (!flags.legacyMongoHooks) return next();. - Update every controller/service that writes Legs/Jobs to call the explicit recalc inside the transaction.
- QA: run
tests/controllers/legs/*with the flag both on and off. Both must pass.
3c. Rollout & removalβ
- Flip flag to
falseon the internal tenant first, monitor 48h. - Flip per-tenant for a small pilot, then all.
- After 1 week clean, delete the hooks entirely.
Backward compat: flag flips behavior instantly. Data is unchanged β the hooks were only maintaining derived fields the new services also maintain.
Rollback: flip flag back on.
Gate: zero divergence between pre-flag and post-flag job.estimatedDuration in shadow comparison on a sampled tenant.
Phase 4 β Remove Invoice.post('find' | 'findOne')β
4a. Backfill calculatedTotalCharges (one-time migration)β
- New
scripts/backfill-invoice-calculated-totals.js. Iterates invoices in tenant-scoped batches, populatesjobswithcharges.*, writescalculatedTotalCharges. Idempotent (skip when already set), supports--dry-run, resume token inmigrationscollection.
4b. Read code changeβ
- Controllers always read
calculatedTotalCharges. If missing (0/null), callrecalculateInvoiceTotals(invoiceId)lazily β but only inside the write path, never inside a GET.
4c. Remove the post('find') and post('findOne') hooksβ
- Only after backfill is 100% complete and 48h of green telemetry on the new read path.
Backward compat: during 4a/4b both the hook and the backfill produce the same value, so double-compute is harmless. During 4b old clients still work. Only 4c removes behavior. Rollback: revert 4c; the hook returns.
Phase 5 β Retire mongoose-autopopulateβ
Model by model. Each model is one PR.
Order (lowest blast-radius first):
RouteLocationManifestPayoutInvoiceLegUserCustomer
Per-model checklist:
- Search every populate site for the model's references.
- Convert each to explicit
.populate({ path, select }). - Add
mongoose-lean-virtualsto the schema (solean({ virtuals: true })works). - Remove
autopopulate: truefrom schema paths and remove the plugin registration. - Run full test suite + relevant QA matrix before next model.
Backward compat: per-model. Missing explicit populate yields an unpopulated field rather than wrong data; caught in tests before merge. Rollback: revert the PR; plugin flips back on.
Phase 6 β Unify parentCompany to ObjectIdβ
The trickiest phase. Pattern: dual-read, single-write-canonical, background migrate, flip schema. Uses normalizeTenantFilter() from Phase 0 as the load-bearing abstraction.
6a. Dual-read helper, canonical-write (one PR)β
- Adopt
normalizeTenantFilter(value)in every query that filters byparentCompany. Audit by grepping for{ parentCompanyand converting each call site. - Introduce a write helper that always writes
parentCompanyas anObjectId. Existing string docs stay strings until the per-collection migration runs. - No schema type changes yet.
- Add metric:
tenant_query.fallback_hitsβ counter of queries where the dual-read fallback returned a string-shaped match.
6b. Per-collection background migrationβ
- One migration script per affected collection (Leg, Job, Client, Route, Address, Location, Task, Accessory, Driver, Equipment, History, InventoryItem, Media, Message, Organization, Recipe, Shift, TaskTemplate, TaskTemplateGroup, TimelineItem, ChatConversation).
- Each script:
- Walks the collection in batches.
- Skips already-ObjectId docs.
- Updates via
updateOnewith{ $set: { parentCompany: ObjectId(...) } }. - Tracks progress in
migrationscollection:{ name, lastId, completedAt }.
- Pilot order (least downtime):
- Internal tenant only (
--tenant=ferda-tech). Validate end-to-end for 48h. - Smallest 25% of tenants by document count. Validate 48h.
- Next 50%. Validate 24h.
- Largest tenants.
- Internal tenant only (
- Per-collection order within each tenant batch: leaf collections first (
Address,Accessory,Media), then mid-tier (Route,Location,Client), then high-volume (Leg,Job,Task).
6c. Flip schema typeβ
- Once a collection reports 100% ObjectId for all tenants AND
tenant_query.fallback_hits == 0for 48h, change the schema type fromStringtoSchema.Types.ObjectId, ref: "Customer". - Keep the dual-read helper in place for one more release in case of stragglers, then simplify it to a straight equality match.
6d. Standardize refβ
- While in there, ensure every model declares
ref: "Customer"on itsparentCompanysopopulateand$lookupwork uniformly.
Backward compat: the dual-read helper makes shape irrelevant to query results. Writes are canonical from 6a forward, so no drift accumulates. Rollback: at any point, revert the most recent PR; the dual-read helper keeps prior data readable. Gates:
- 6a ships only when every production query path uses
normalizeTenantFilter. - 6b marks a collection "done" only when
migrations.completedAtis set for every tenant andtenant_query.fallback_hits == 0for 24h. - 6c ships per-collection, never bulk.
Phase 7 β Tenant-scoped uniquesβ
Blocked on Phase 6c for the affected models, but Organization work can start during 6b.
Changes:
- Pre-flight collision scan:
scripts/check-tenant-unique-collisions.jsfor each(parentCompany, field)pair. Resolve collisions (rename) before applying the constraint.Organization:(parentCompany, name)Client:(parentCompany, slug)
- Build the non-unique compound index first; verify zero duplicates via aggregation.
- In a separate PR: drop the global
unique: true, rebuild as{ unique: true }compound index. - Slug/name generators stay tenant-aware (Organization is user-supplied; Client.slug already tenant-scoped semantically).
Backward compat: non-unique β validate β unique is fully safe. Old global unique is dropped only after the new compound unique is confirmed built. Rollback: drop the new unique, restore the old one if collisions appear (after another rename pass).
Phase 8 β Ergonomics & hygieneβ
Can interleave with later phases. Low risk.
Changes:
src/models/Leg.jsβ remove duplicate plugin registrations.- New
src/models/plugins/baseSchemaPlugin.jsthat applies the standard plugins (paginate, aggregate-paginate, lean-virtuals) once. Migrate every model to use it. - Standardize timestamps: keep
dateCreated/dateUpdated(renaming is too expensive across consumers), but add explicit indexes where list endpoints sort by them. - Remove deprecated
useNewUrlParser/useUnifiedTopologyfromscripts/*.js. Schema.Types.Mixedaudit: switch in-place mutations to assignment (doc.metadata = { ...doc.metadata, ... }) or addmarkModifiedcalls. Focus models:Job.metadata,Invoice.totals,Leg.calculationMetadata,Config.configs,Customer.metadata.- Extract typed sub-schemas for stable shapes:
Job.approval.history,Leg.statusHistory,Leg.statusExceptions.
Observability prerequisitesβ
Must be in place before Phase 3:
- Mongoose query timing histogram (extend
src/middlewares/performanceMonitor.js). - Counter:
tenant_query.fallback_hitsper collection (Phase 6). - Counter:
mongoose_autopopulate.activationsper collection (Phase 5). Wrap the plugin during the transition. - Atlas slow-query alerts on
legs,jobs,invoices. - Feature-flag state snapshot exported with every request log.
Rollback Strategy by Phaseβ
| Phase | Rollback action |
|---|---|
| 0 | PR revert |
| 1 | PR revert |
| 2 | Drop new indexes via scripts/manage-indexes.js --drop |
| 3 | Flip features.legacyMongoHooks back to true |
| 4 | Revert 4c PR; hook returns; backfill remains correct |
| 5 | Revert per-model PR; autopopulate plugin returns |
| 6 | Revert most recent PR; dual-read helper keeps both shapes readable |
| 7 | Drop new compound unique; restore prior global unique |
| 8 | PR revert |
Open Itemsβ
- Tenant pilot list for Phase 6: confirm
ferda-techas the canary, plus 2 friendly customers willing to be in the smallest 25% batch. - Transaction support: confirm Atlas tier supports cross-document transactions on the
legs/jobs/invoicescollections (M10+ replica set β already true). - CI gate: decide whether to add a CI check that fails when a new schema introduces
parentCompany: String(so we don't regress during Phase 6 rollout).
Referencesβ
- Audit transcript: chat that produced this plan (linked from team notes).
- Mongoose 8 docs
- MongoDB ESR rule
- Mongoose transactions
- Mongoose middleware caveats