Files
email-organizer/docs/design/processed-emails-spec.md
2025-08-06 15:38:49 -07:00

11 KiB

Processed Emails Feature Specification

Overview

This document outlines the specification for implementing a feature to persistently track which emails have been processed by the Email Organizer system. The goal is to maintain a record of email processing status to avoid reprocessing the same emails during synchronization and provide accurate pending email counts.

Requirements

1. Email Tracking Requirements

  • Unique Email Identification: Track emails using a unique identifier (UID) provided by the IMAP server, along with the folder name and user ID
  • Processing Status: Mark emails as either "pending" (unprocessed) or "processed"
  • Minimal Data Storage: Store only essential information - email UID, folder, user, and processing status - not email content, subjects, or bodies
  • Persistence: Maintain processing status across application restarts and synchronization cycles
  • Efficient Lookup: Quickly determine which emails in a folder are pending processing

2. Synchronization Requirements

  • Initial Sync: During first synchronization of a folder, all emails should be marked as "pending"
  • Incremental Sync: On subsequent syncs, only emails that haven't been processed should be identified as pending
  • Status Update: When an email is processed, update its status from "pending" to "processed"
  • Cleanup: Remove records for emails that no longer exist on the IMAP server (optional for future enhancement)

3. Performance Requirements

  • Efficient Storage: Use appropriate database indexing for fast lookups
  • Minimal Memory Usage: Store only essential data to keep memory footprint low
  • Batch Processing: Support batch operations for processing multiple emails efficiently

Data Model Design

ProcessedEmails Table

erDiagram
    USER {
        int id PK "Primary Key"
        string email "Unique, Not Null"
        string first_name "Not Null"
        string last_name "Not Null"
        string password_hash "Not Null"
        json imap_config "JSON Configuration"
        datetime created_at "Default: UTC Now"
        datetime updated_at "Default: UTC Now, On Update"
    }
    
    FOLDER {
        int id PK "Primary Key"
        int user_id FK "Foreign Key to User"
        string name "Not Null"
        text rule_text "Natural Language Rule"
        int priority "Processing Order"
        boolean organize_enabled "Default: True"
        int total_count "Default: 0"
        int pending_count "Default: 0"
        json recent_emails "JSON Array"
        datetime created_at "Default: UTC Now"
        datetime updated_at "Default: UTC Now, On Update"
    }
    
    PROCESSED_EMAIL {
        int id PK "Primary Key"
        int user_id FK "Foreign Key to User"
        int folder_id FK "Foreign Key to Folder"
        string email_uid "Not Null" "IMAP Email UID"
        string folder_name "Not Null" "IMAP Folder Name"
        boolean is_processed "Default: False" "Processing Status"
        datetime first_seen_at "Default: UTC Now" "First seen during sync"
        datetime processed_at "Nullable" "When email was processed"
        datetime created_at "Default: UTC Now"
        datetime updated_at "Default: UTC Now, On Update"
    }
    
    USER ||--o{ FOLDER : "has"
    USER ||--o{ PROCESSED_EMAIL : "has"
    FOLDER ||--o{ PROCESSED_EMAIL : "has"

Column Specifications

Table Column Data Type Constraints Description
PROCESSED_EMAIL id Integer Primary Key, Autoincrement Unique identifier for each processed email record
PROCESSED_EMAIL user_id Integer Foreign Key to User, Not Null Reference to the user who owns this email
PROCESSED_EMAIL folder_id Integer Foreign Key to Folder, Not Null Reference to the folder this email belongs to
PROCESSED_EMAIL email_uid String(255) Not Null Unique ID of the email from IMAP server
PROCESSED_EMAIL folder_name String(255) Not Null Name of the IMAP folder (for redundancy)
PROCESSED_EMAIL is_processed Boolean Default: False Processing status (false=pending, true=processed)
PROCESSED_EMAIL first_seen_at DateTime Default: datetime.utcnow First time this email was detected during sync
PROCESSED_EMAIL processed_at DateTime Nullable When the email was marked as processed
PROCESSED_EMAIL created_at DateTime Default: datetime.utcnow Record creation timestamp
PROCESSED_EMAIL updated_at DateTime Default: datetime.utcnow, On Update Record update timestamp

Relationships

  • User to ProcessedEmail: One-to-many relationship - each user can have multiple processed email records
  • Folder to ProcessedEmail: One-to-many relationship - each folder can have multiple processed email records
  • Composite Key: The combination of (user_id, folder_name, email_uid) should be unique to prevent duplicate records

Database Indexes

  • Primary key index on id
  • Foreign key indexes on user_id and folder_id
  • Composite unique index on (user_id, folder_name, email_uid)
  • Index on folder_name for faster folder-based queries
  • Index on is_processed for filtering pending emails
  • Index on first_seen_at for tracking recently added emails

Service Design

ProcessedEmailsService

A new service class will be responsible for managing processed email records:

class ProcessedEmailsService:
    def __init__(self, user: User):
        self.user = user
    
    def get_pending_emails(self, folder_name: str) -> List[str]:
        """Get list of email UIDs that are pending processing in a folder."""
        
    def mark_email_processed(self, folder_name: str, email_uid: str) -> bool:
        """Mark an email as processed."""
        
    def mark_emails_processed(self, folder_name: str, email_uids: List[str]) -> int:
        """Mark multiple emails as processed in bulk."""
        
    def sync_folder_emails(self, folder_name: str, email_uids: List[str]) -> int:
        """Sync email UIDs for a folder, adding new ones as pending."""
        
    def get_pending_count(self, folder_name: str) -> int:
        """Get count of pending emails for a folder."""
        
    def cleanup_old_records(self, folder_name: str, current_uids: List[str]) -> int:
        """Remove records for emails that no longer exist in the folder."""

IMAPService Integration

The existing IMAP service will be enhanced to use the ProcessedEmailsService:

class IMAPService:
    def __init__(self, user: User):
        self.user = user
        self.config = user.imap_config or {}
        self.connection = None
        self.processed_emails_service = ProcessedEmailsService(user)
    
    def get_folder_email_count(self, folder_name: str) -> int:
        """Get the count of emails in a specific folder, considering processed status."""
        
    def get_pending_emails(self, folder_name: str) -> List[str]:
        """Get email UIDs that are pending processing."""
        
    def sync_folders(self) -> Tuple[bool, str]:
        """Sync IMAP folders with local database, tracking email processing status."""

API Endpoints

New HTMX Endpoints for Processed Email Management

  1. Get Pending Emails for a Folder

    • Method: GET
    • Path: /api/folders/<folder_id>/pending-emails
    • Response: An Dialog List of email metadata for pending emails (subject, date, UID), a button to preview the email (fetch it from the imap server)
  2. Mark Email as Processed

    • Method: POST
    • Path: /api/folders/<folder_id>/emails/<email_uid>/process
    • Action: Mark a specific email as processed
    • Response: Updated dialog body.

Workflow Integration

Email Processing Flow

sequenceDiagram
    participant U as User
    participant B as Browser
    participant M as Main Blueprint
    participant I as IMAP Service
    participant P as ProcessedEmails Service
    participant DB as Database
    
    U->>B: Click "Sync Folders"
    B->>M: POST /api/imap/sync
    M->>I: Sync folders with processed email tracking
    I->>I: Connect to IMAP server
    I->>I: Get list of email UIDs for folder
    I->>P: sync_folder_emails(folder_name, email_uids)
    P->>DB: Create pending email records
    P->>I: Return list of pending email UIDs
    I->>M: Return sync results
    M->>B: Update UI with pending counts

Email Processing Status Update

sequenceDiagram
    participant U as User
    participant B as Browser
    participant M as Main Blueprint
    participant P as ProcessedEmails Service
    participant DB as Database
    
    U->>B: Trigger email processing
    B->>M: POST /api/folders/<folder_id>/process-emails
    M->>P: mark_emails_processed(folder_name, email_uids)
    P->>DB: Update email processing status
    P->>M: Return success count
    M->>B: Update UI with new counts

Migration Strategy

Phase 1: Data Model Implementation

  1. Create the processed_emails table with appropriate indexes
  2. Implement the ProcessedEmailsService class
  3. Add basic CRUD operations for email processing records

Phase 2: IMAP Service Integration

  1. Update IMAPService to use ProcessedEmailsService
  2. Modify folder synchronization to track email UIDs
  3. Update email count methods to consider processing status

Phase 3: API and UI Integration

  1. Add API endpoints for processed email management
  2. Update UI to display accurate pending counts
  3. Add bulk processing capabilities

Phase 4: Optimization and Cleanup

  1. Implement batch processing for performance
  2. Add periodic cleanup of orphaned records
  3. Optimize database queries for large datasets

Security Considerations

  1. Access Control: Ensure users can only access their own email processing records
  2. Data Validation: Validate all email UIDs and folder names to prevent injection attacks
  3. Rate Limiting: Implement rate limiting for email processing endpoints to prevent abuse
  4. Data Privacy: Ensure no sensitive email content is stored in the database

Performance Considerations

  1. Database Indexing: Proper indexing on frequently queried fields
  2. Batch Operations: Use batch operations for processing multiple emails
  3. Memory Management: Process emails in batches to avoid memory issues with large mailboxes
  4. Caching: Consider caching frequently accessed email processing status

Future Enhancements

  1. Email Movement Tracking: Track when emails are moved between folders
  2. Processing History: Maintain a history of email processing actions
  3. Email Deduplication: Handle duplicate emails across folders
  4. Automated Cleanup: Periodic cleanup of old or orphaned processing records
  5. Analytics: Provide insights into email processing patterns and efficiency