Files
integreat/docs/plans/2026-04-24-001-refactor-detailed-sales-to-parquet-plan.md
Bryce 26c9563a03 feat(sales): initial Parquet migration infrastructure
- Add DuckDB/S3 parquet storage layer (auto-ap.storage.parquet)
- Add sales_to_parquet migration script for historical data
- Add cleanup_sales for post-migration Datomic cleanup
- Add sales_orders_new.clj with DuckDB read layer for SSR views
- Add test scaffolding for parquet storage
- Add plan document for move-detailed-sales-to-parquet

feat(sales): redirect production and read flows to Parquet/DuckDB

- U3: Square production (upsert) now buffers to parquet via flatten-order-to-parquet!
- U3: EzCater core import-order now buffers to parquet instead of Datomic transact
- U3: EzCater XLS upload-xls now buffers to parquet instead of audit-transact
- U4: Rewrite sales_orders.clj to read from DuckDB via pq/get-sales-orders
- U5: Rewrite sales_summaries to use parquet aggregation functions
  - get-payment-items-parquet, get-discounts-parquet, get-refund-items-parquet
  - get-tax-parquet, get-tip-parquet, get-sales-parquet
- Add sum-* aggregation functions to storage/sales_summaries.clj
  - sum-discounts, sum-refunds-by-type, sum-taxes, sum-tips, sum-sales-by-category
2026-04-27 09:41:25 -07:00

9.8 KiB

title, type, status, date
title type status date
Move Detailed Sales Data to DuckDB and Parquet refactor active 2026-04-24

Move Detailed Sales Data to DuckDB and Parquet

Overview

Detailed sales records (orders, charges, line items, refunds) are currently stored in Datomic. Because Datomic is append-only, this high-volume data causes significant storage bloat. We will move these details to Parquet files stored on S3, using DuckDB as the query engine for views and summaries, while keeping the high-level sales-summaries in Datomic for ledger calculations.


Problem Frame

The system stores every individual sale and payment detail in Datomic. While useful for auditing, this data is rarely accessed in detail after a few weeks, yet it permanently increases the Datomic database size. The app needs a "colder" but still queryable storage layer for these details.


Requirements Trace

  • R1. Detailed sales/payment entities must be moved from Datomic to Parquet files on S3.
  • R2. sales-summaries must remain in Datomic to ensure ledger calculations remain performant and stable.
  • R3. The "Sales Orders" and "Payments" views must continue to function (filtering, sorting, pagination) by querying the Parquet files via DuckDB.
  • R4. The daily sales summary job must be updated to aggregate data from DuckDB instead of Datomic.
  • R5. The system must handle "voids" of payments/orders in an immutable file format.

Scope Boundaries

  • In Scope:
    • Implementation of Parquet writer for sales data.
    • DuckDB integration for reading S3 Parquet files.
    • Migration of existing detailed data from Datomic to S3.
    • Updating the summary aggregation job.
  • Out of Scope:
    • Moving sales-summaries out of Datomic.
    • Implementing a real-time streaming pipeline (sticking to batch/daily flushes).

Context & Research

Relevant Code and Patterns

  • Production Flow: auto-ap.square.core3, auto-ap.ezcater.core, and auto-ap.routes.ezcater-xls all produce tagged maps that are currently sent to dc/transact.
  • Read Flow: auto-ap.datomic.sales-orders and auto-ap.ssr.payments perform the current Datomic queries.
  • Aggregation: auto-ap.jobs.sales-summaries uses dc/q to sum totals for the day.

Key Technical Decisions

  • Storage Format: Parquet. It is columnar, highly compressed, and natively supported by DuckDB.
  • Storage Location: AWS S3. This removes the need for a managed database server.
  • Query Engine: DuckDB. It can query Parquet files directly on S3 without importing them into a local database.
  • Write Strategy: Daily Batch. To avoid the "small file problem" in S3/Parquet, data will be buffered (locally or in a staging table) and flushed as one file per day: s3://bucket/sales-details/YYYY-MM-DD.parquet.
  • Voiding Strategy: Append-only log. A "void" is simply a new record with the same external-id and a status: voided. The read query will always select the record with the latest timestamp for a given ID.

Implementation Units

  • U1. S3 Storage & DuckDB Infrastructure

Goal: Setup the S3 bucket structure and the DuckDB connection utility.

Requirements: R1, R3

Dependencies: None

Files:

  • Create: src/clj/auto_ap/storage/parquet.clj (DuckDB connection and S3 config)

Approach:

  • Implement a with-duckdb wrapper that initializes DuckDB, loads the httpfs extension, and configures S3 credentials.

Verification:

  • A test that can run a simple SELECT 1 via DuckDB.

  • U2. Parquet Writer Implementation

Goal: Create a service to convert sales maps into Parquet files and upload them to S3.

Requirements: R1

Dependencies: U1

Files:

  • Modify: src/clj/auto_ap/storage/parquet.clj
  • Test: test/clj/auto_ap/storage/parquet_test.clj

Approach:

  • Implement a flush-to-parquet function that takes a collection of maps and uses a library to create the file.
  • Implement the S3 upload logic.
  • Recovery: Implement a "flush-log" in the local SQLite WAL. Mark records as flushed: true only after receiving a successful 200 OK from S3. On startup, the system should check for unflushed records and trigger a retry.

Test scenarios:

  • Happy path: Write a list of 10 sales orders to a Parquet file and verify it exists on S3.
  • Error path: Simulate an S3 connection failure during flush and verify that records remain in the local WAL and are successfully flushed on the next attempt.
  • Edge case: Handle empty data sets without creating empty files.

Verification:

  • Successful upload of a Parquet file that is readable by an external DuckDB CLI.

  • U3. Redirect Production Flow

Goal: Change the Square/EzCater integrations to write to the Parquet writer instead of Datomic.

Requirements: R1

Dependencies: U2

Files:

  • Modify: src/clj/auto_ap/square/core3.clj
  • Modify: src/clj/auto_ap/ezcater/core.clj
  • Modify: src/clj/auto_ap/routes/ezcater_xls.clj

Approach:

  • Replace dc/transact calls for detailed sales/charges with calls to the new parquet/write service.
  • Note: Keep the transaction for any related entities that must stay in Datomic (e.g., Client updates).

Verification:

  • Run a Square import and verify that no new detailed entities appear in Datomic, but a new Parquet file is created.

  • U4. DuckDB Read Layer for Views

Goal: Update the "Sales Orders" and "Payments" views to fetch data from DuckDB.

Requirements: R3, R5

Dependencies: U1

Files:

  • Modify: src/clj/auto_ap/datomic/sales_orders.clj
  • Modify: src/clj/auto_ap/ssr/payments.clj
  • Test: test/clj/auto_ap/integration/graphql/checks.clj

Approach:

  • Replace Datomic q and pull calls with DuckDB SQL queries.
  • Performance: To optimize pagination, implement a "Metadata Index" file on S3 (or a Datomic entity) that stores the total record count per day. Use this to calculate pagination totals without scanning all Parquet files.
  • Deterministic Voids: Use a combination of timestamp and a monotonic sequence_number for the QUALIFY clause to ensure deterministic results for records updated in the same millisecond.
  • Map DuckDB result sets back to the existing map formats used by the views to minimize frontend changes.

Test scenarios:

  • Happy path: List payments for a client across a date range.
  • Integration: Void a payment in S3 and verify the view shows it as voided.
  • Performance: Verify pagination totals load in < 200ms using the metadata index.
  • Edge case: Handle two updates to the same record in the same millisecond and verify the latest sequence number wins.

Verification:

  • The Payments table in the UI loads correctly and reflects the data in S3.

  • U5. Update Summary Aggregation Job

Goal: Update the sales-summaries job to calculate totals using DuckDB.

Requirements: R2, R4

Dependencies: U1

Files:

  • Modify: src/clj/auto_ap/jobs/sales_summaries.clj

Approach:

  • In get-payment-items, get-discounts, get-tax, etc., replace the dc/q calls with DuckDB SQL SUM and GROUP BY queries against the daily Parquet files.
  • Ensure the results are still written to the sales-summary entities in Datomic.

Verification:

  • Run the sales-summaries-v2 job and verify that the resulting Datomic summaries match the values in the S3 Parquet files.

  • U6. Historical Data Migration

Goal: Move all existing detailed sales data from Datomic to Parquet files.

Requirements: R1

Dependencies: U2

Files:

  • Create: src/clj/auto_ap/migration/sales_to_parquet.clj

Approach:

  • Write a script that iterates through all historical sales orders and payments in Datomic.
  • Group them by Business Date (the date of the sale, not the transaction date) to ensure consistency with future DuckDB queries.
  • Write each day's data to the corresponding YYYY-MM-DD.parquet file on S3.
  • Log any records with missing dates to a "dead-letter" file for manual review.

Verification:

  • Count of records in Datomic vs count of records in S3.

  • U7. Datomic Cleanup

Goal: Remove the detailed data from Datomic to reclaim space.

Requirements: R1

Dependencies: U6

Files:

  • Create: src/clj/auto_ap/migration/cleanup_sales.clj

Approach:

  • Use [:db/retractEntity ...] to remove all #:sales-order, #:charge, and #:sales-refund entities.
  • Batching: Perform retractions in batches (e.g., by month) with a cooldown period between batches to avoid excessive Datomic transaction log bloat and performance degradation.
  • Safety: Only run this after verifying U6 and U4.

Verification:

  • Datomic database size decreases; detailed queries in Datomic return empty, while DuckDB queries return data.

System-Wide Impact

  • Interaction graph: The integration cores now depend on the Parquet/S3 service. The SSR views and Background Jobs now depend on the DuckDB service.
  • Error propagation: S3 downtime will now cause "Sales Orders" views to fail and the Summary Job to fail. We should implement basic retry logic in the DuckDB wrapper.
  • State lifecycle risks: There is a window between the "production" of a sale and the "flush" to Parquet. If the app crashes before a flush, data could be lost. Mitigation: Use a small local SQLite file as a write-ahead log for the daily buffer.

Risks & Dependencies

Risk Mitigation
S3 Latency for Views Use DuckDB's caching and only query the files for the requested date range.
Data Loss before Flush Implement a local SQLite staging file for the current day's data.
Schema Drift Use a strict schema for Parquet files; handle missing columns in SQL with COALESCE.

Sources & References

  • Related code: src/clj/auto_ap/jobs/sales_summaries.clj
  • Related code: src/clj/auto_ap/ssr/payments.clj
  • External docs: DuckDB S3 Integration