feat(sales): initial Parquet migration infrastructure
- Add DuckDB/S3 parquet storage layer (auto-ap.storage.parquet) - Add sales_to_parquet migration script for historical data - Add cleanup_sales for post-migration Datomic cleanup - Add sales_orders_new.clj with DuckDB read layer for SSR views - Add test scaffolding for parquet storage - Add plan document for move-detailed-sales-to-parquet feat(sales): redirect production and read flows to Parquet/DuckDB - U3: Square production (upsert) now buffers to parquet via flatten-order-to-parquet! - U3: EzCater core import-order now buffers to parquet instead of Datomic transact - U3: EzCater XLS upload-xls now buffers to parquet instead of audit-transact - U4: Rewrite sales_orders.clj to read from DuckDB via pq/get-sales-orders - U5: Rewrite sales_summaries to use parquet aggregation functions - get-payment-items-parquet, get-discounts-parquet, get-refund-items-parquet - get-tax-parquet, get-tip-parquet, get-sales-parquet - Add sum-* aggregation functions to storage/sales_summaries.clj - sum-discounts, sum-refunds-by-type, sum-taxes, sum-tips, sum-sales-by-category
This commit is contained in:
@@ -0,0 +1,250 @@
|
||||
---
|
||||
title: Move Detailed Sales Data to DuckDB and Parquet
|
||||
type: refactor
|
||||
status: active
|
||||
date: 2026-04-24
|
||||
---
|
||||
|
||||
# Move Detailed Sales Data to DuckDB and Parquet
|
||||
|
||||
## Overview
|
||||
|
||||
Detailed sales records (orders, charges, line items, refunds) are currently stored in Datomic. Because Datomic is append-only, this high-volume data causes significant storage bloat. We will move these details to Parquet files stored on S3, using DuckDB as the query engine for views and summaries, while keeping the high-level `sales-summaries` in Datomic for ledger calculations.
|
||||
|
||||
---
|
||||
|
||||
## Problem Frame
|
||||
|
||||
The system stores every individual sale and payment detail in Datomic. While useful for auditing, this data is rarely accessed in detail after a few weeks, yet it permanently increases the Datomic database size. The app needs a "colder" but still queryable storage layer for these details.
|
||||
|
||||
---
|
||||
|
||||
## Requirements Trace
|
||||
|
||||
- R1. Detailed sales/payment entities must be moved from Datomic to Parquet files on S3.
|
||||
- R2. `sales-summaries` must remain in Datomic to ensure ledger calculations remain performant and stable.
|
||||
- R3. The "Sales Orders" and "Payments" views must continue to function (filtering, sorting, pagination) by querying the Parquet files via DuckDB.
|
||||
- R4. The daily sales summary job must be updated to aggregate data from DuckDB instead of Datomic.
|
||||
- R5. The system must handle "voids" of payments/orders in an immutable file format.
|
||||
|
||||
---
|
||||
|
||||
## Scope Boundaries
|
||||
|
||||
- **In Scope:**
|
||||
- Implementation of Parquet writer for sales data.
|
||||
- DuckDB integration for reading S3 Parquet files.
|
||||
- Migration of existing detailed data from Datomic to S3.
|
||||
- Updating the summary aggregation job.
|
||||
- **Out of Scope:**
|
||||
- Moving `sales-summaries` out of Datomic.
|
||||
- Implementing a real-time streaming pipeline (sticking to batch/daily flushes).
|
||||
|
||||
---
|
||||
|
||||
## Context & Research
|
||||
|
||||
### Relevant Code and Patterns
|
||||
|
||||
- **Production Flow:** `auto-ap.square.core3`, `auto-ap.ezcater.core`, and `auto-ap.routes.ezcater-xls` all produce tagged maps that are currently sent to `dc/transact`.
|
||||
- **Read Flow:** `auto-ap.datomic.sales-orders` and `auto-ap.ssr.payments` perform the current Datomic queries.
|
||||
- **Aggregation:** `auto-ap.jobs.sales-summaries` uses `dc/q` to sum totals for the day.
|
||||
|
||||
---
|
||||
|
||||
## Key Technical Decisions
|
||||
|
||||
- **Storage Format:** Parquet. It is columnar, highly compressed, and natively supported by DuckDB.
|
||||
- **Storage Location:** AWS S3. This removes the need for a managed database server.
|
||||
- **Query Engine:** DuckDB. It can query Parquet files directly on S3 without importing them into a local database.
|
||||
- **Write Strategy:** Daily Batch. To avoid the "small file problem" in S3/Parquet, data will be buffered (locally or in a staging table) and flushed as one file per day: `s3://bucket/sales-details/YYYY-MM-DD.parquet`.
|
||||
- **Voiding Strategy:** Append-only log. A "void" is simply a new record with the same `external-id` and a `status: voided`. The read query will always select the record with the latest timestamp for a given ID.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Units
|
||||
|
||||
- U1. **S3 Storage & DuckDB Infrastructure**
|
||||
|
||||
**Goal:** Setup the S3 bucket structure and the DuckDB connection utility.
|
||||
|
||||
**Requirements:** R1, R3
|
||||
|
||||
**Dependencies:** None
|
||||
|
||||
**Files:**
|
||||
- Create: `src/clj/auto_ap/storage/parquet.clj` (DuckDB connection and S3 config)
|
||||
|
||||
**Approach:**
|
||||
- Implement a `with-duckdb` wrapper that initializes DuckDB, loads the `httpfs` extension, and configures S3 credentials.
|
||||
|
||||
**Verification:**
|
||||
- A test that can run a simple `SELECT 1` via DuckDB.
|
||||
|
||||
---
|
||||
|
||||
- U2. **Parquet Writer Implementation**
|
||||
|
||||
**Goal:** Create a service to convert sales maps into Parquet files and upload them to S3.
|
||||
|
||||
**Requirements:** R1
|
||||
|
||||
**Dependencies:** U1
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/clj/auto_ap/storage/parquet.clj`
|
||||
- Test: `test/clj/auto_ap/storage/parquet_test.clj`
|
||||
|
||||
**Approach:**
|
||||
- Implement a `flush-to-parquet` function that takes a collection of maps and uses a library to create the file.
|
||||
- Implement the S3 upload logic.
|
||||
- **Recovery:** Implement a "flush-log" in the local SQLite WAL. Mark records as `flushed: true` only after receiving a successful 200 OK from S3. On startup, the system should check for unflushed records and trigger a retry.
|
||||
|
||||
**Test scenarios:**
|
||||
- Happy path: Write a list of 10 sales orders to a Parquet file and verify it exists on S3.
|
||||
- Error path: Simulate an S3 connection failure during flush and verify that records remain in the local WAL and are successfully flushed on the next attempt.
|
||||
- Edge case: Handle empty data sets without creating empty files.
|
||||
|
||||
**Verification:**
|
||||
- Successful upload of a Parquet file that is readable by an external DuckDB CLI.
|
||||
|
||||
---
|
||||
|
||||
- U3. **Redirect Production Flow**
|
||||
|
||||
**Goal:** Change the Square/EzCater integrations to write to the Parquet writer instead of Datomic.
|
||||
|
||||
**Requirements:** R1
|
||||
|
||||
**Dependencies:** U2
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/clj/auto_ap/square/core3.clj`
|
||||
- Modify: `src/clj/auto_ap/ezcater/core.clj`
|
||||
- Modify: `src/clj/auto_ap/routes/ezcater_xls.clj`
|
||||
|
||||
**Approach:**
|
||||
- Replace `dc/transact` calls for detailed sales/charges with calls to the new `parquet/write` service.
|
||||
- *Note:* Keep the transaction for any related entities that must stay in Datomic (e.g., Client updates).
|
||||
|
||||
**Verification:**
|
||||
- Run a Square import and verify that no new detailed entities appear in Datomic, but a new Parquet file is created.
|
||||
|
||||
---
|
||||
|
||||
- U4. **DuckDB Read Layer for Views**
|
||||
|
||||
**Goal:** Update the "Sales Orders" and "Payments" views to fetch data from DuckDB.
|
||||
|
||||
**Requirements:** R3, R5
|
||||
|
||||
**Dependencies:** U1
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/clj/auto_ap/datomic/sales_orders.clj`
|
||||
- Modify: `src/clj/auto_ap/ssr/payments.clj`
|
||||
- Test: `test/clj/auto_ap/integration/graphql/checks.clj`
|
||||
|
||||
**Approach:**
|
||||
- Replace Datomic `q` and `pull` calls with DuckDB SQL queries.
|
||||
- **Performance:** To optimize pagination, implement a "Metadata Index" file on S3 (or a Datomic entity) that stores the total record count per day. Use this to calculate pagination totals without scanning all Parquet files.
|
||||
- **Deterministic Voids:** Use a combination of `timestamp` and a monotonic `sequence_number` for the `QUALIFY` clause to ensure deterministic results for records updated in the same millisecond.
|
||||
- Map DuckDB result sets back to the existing map formats used by the views to minimize frontend changes.
|
||||
|
||||
**Test scenarios:**
|
||||
- Happy path: List payments for a client across a date range.
|
||||
- Integration: Void a payment in S3 and verify the view shows it as voided.
|
||||
- Performance: Verify pagination totals load in < 200ms using the metadata index.
|
||||
- Edge case: Handle two updates to the same record in the same millisecond and verify the latest sequence number wins.
|
||||
|
||||
**Verification:**
|
||||
- The Payments table in the UI loads correctly and reflects the data in S3.
|
||||
|
||||
---
|
||||
|
||||
- U5. **Update Summary Aggregation Job**
|
||||
|
||||
**Goal:** Update the `sales-summaries` job to calculate totals using DuckDB.
|
||||
|
||||
**Requirements:** R2, R4
|
||||
|
||||
**Dependencies:** U1
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/clj/auto_ap/jobs/sales_summaries.clj`
|
||||
|
||||
**Approach:**
|
||||
- In `get-payment-items`, `get-discounts`, `get-tax`, etc., replace the `dc/q` calls with DuckDB SQL `SUM` and `GROUP BY` queries against the daily Parquet files.
|
||||
- Ensure the results are still written to the `sales-summary` entities in Datomic.
|
||||
|
||||
**Verification:**
|
||||
- Run the `sales-summaries-v2` job and verify that the resulting Datomic summaries match the values in the S3 Parquet files.
|
||||
|
||||
---
|
||||
|
||||
- U6. **Historical Data Migration**
|
||||
|
||||
**Goal:** Move all existing detailed sales data from Datomic to Parquet files.
|
||||
|
||||
**Requirements:** R1
|
||||
|
||||
**Dependencies:** U2
|
||||
|
||||
**Files:**
|
||||
- Create: `src/clj/auto_ap/migration/sales_to_parquet.clj`
|
||||
|
||||
**Approach:**
|
||||
- Write a script that iterates through all historical sales orders and payments in Datomic.
|
||||
- Group them by **Business Date** (the date of the sale, not the transaction date) to ensure consistency with future DuckDB queries.
|
||||
- Write each day's data to the corresponding `YYYY-MM-DD.parquet` file on S3.
|
||||
- Log any records with missing dates to a "dead-letter" file for manual review.
|
||||
|
||||
**Verification:**
|
||||
- Count of records in Datomic vs count of records in S3.
|
||||
|
||||
---
|
||||
|
||||
- U7. **Datomic Cleanup**
|
||||
|
||||
**Goal:** Remove the detailed data from Datomic to reclaim space.
|
||||
|
||||
**Requirements:** R1
|
||||
|
||||
**Dependencies:** U6
|
||||
|
||||
**Files:**
|
||||
- Create: `src/clj/auto_ap/migration/cleanup_sales.clj`
|
||||
|
||||
**Approach:**
|
||||
- Use `[:db/retractEntity ...]` to remove all `#:sales-order`, `#:charge`, and `#:sales-refund` entities.
|
||||
- **Batching:** Perform retractions in batches (e.g., by month) with a cooldown period between batches to avoid excessive Datomic transaction log bloat and performance degradation.
|
||||
- *Safety:* Only run this after verifying U6 and U4.
|
||||
|
||||
**Verification:**
|
||||
- Datomic database size decreases; detailed queries in Datomic return empty, while DuckDB queries return data.
|
||||
|
||||
---
|
||||
|
||||
## System-Wide Impact
|
||||
|
||||
- **Interaction graph:** The integration cores now depend on the Parquet/S3 service. The SSR views and Background Jobs now depend on the DuckDB service.
|
||||
- **Error propagation:** S3 downtime will now cause "Sales Orders" views to fail and the Summary Job to fail. We should implement basic retry logic in the DuckDB wrapper.
|
||||
- **State lifecycle risks:** There is a window between the "production" of a sale and the "flush" to Parquet. If the app crashes before a flush, data could be lost. *Mitigation:* Use a small local SQLite file as a write-ahead log for the daily buffer.
|
||||
|
||||
---
|
||||
|
||||
## Risks & Dependencies
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| S3 Latency for Views | Use DuckDB's caching and only query the files for the requested date range. |
|
||||
| Data Loss before Flush | Implement a local SQLite staging file for the current day's data. |
|
||||
| Schema Drift | Use a strict schema for Parquet files; handle missing columns in SQL with `COALESCE`. |
|
||||
|
||||
---
|
||||
|
||||
## Sources & References
|
||||
|
||||
- Related code: `src/clj/auto_ap/jobs/sales_summaries.clj`
|
||||
- Related code: `src/clj/auto_ap/ssr/payments.clj`
|
||||
- External docs: [DuckDB S3 Integration](https://duckdb.org/docs/extensions/httpfs)
|
||||
Reference in New Issue
Block a user