Files
email-organizer/docs/design/processed-emails-spec.md
2025-08-09 21:04:21 -07:00

12 KiB

Processed Emails Feature Specification

Overview

This document outlines the specification for implementing a feature to persistently track which emails have been processed by the Email Organizer system. The goal is to maintain a record of email processing status to avoid reprocessing the same emails during synchronization and provide accurate pending email counts.

Current Implementation Status

The Processed Emails feature is fully implemented and operational:

Core Implementation

Key Features

  • Email UID tracking for processing status
  • Pending email counts and management
  • Bulk email processing operations
  • Email metadata display and management
  • Integration with IMAP synchronization process

Requirements

1. Email Tracking Requirements

  • Unique Email Identification: Track emails using a unique identifier (UID) provided by the IMAP server, along with the folder name and user ID
  • Processing Status: Mark emails as either "pending" (unprocessed) or "processed"
  • Minimal Data Storage: Store only essential information - email UID, folder, user, and processing status - not email content, subjects, or bodies
  • Persistence: Maintain processing status across application restarts and synchronization cycles
  • Efficient Lookup: Quickly determine which emails in a folder are pending processing

2. Synchronization Requirements

  • Initial Sync: During first synchronization of a folder, all emails should be marked as "pending"
  • Incremental Sync: On subsequent syncs, only emails that haven't been processed should be identified as pending
  • Status Update: When an email is processed, update its status from "pending" to "processed"
  • Cleanup: Remove records for emails that no longer exist on the IMAP server

3. Performance Requirements

  • Efficient Storage: Use appropriate database indexing for fast lookups
  • Minimal Memory Usage: Store only essential data to keep memory footprint low
  • Batch Processing: Support batch operations for processing multiple emails efficiently

Data Model Design

ProcessedEmails Table

erDiagram
    USER {
        int id PK "Primary Key"
        string email "Unique, Not Null"
        string first_name "Not Null"
        string last_name "Not Null"
        string password_hash "Not Null"
        json imap_config "JSON Configuration"
        datetime created_at "Default: UTC Now"
        datetime updated_at "Default: UTC Now, On Update"
    }
    
    FOLDER {
        int id PK "Primary Key"
        int user_id FK "Foreign Key to User"
        string name "Not Null"
        text rule_text "Natural Language Rule"
        int priority "Processing Order"
        boolean organize_enabled "Default: True"
        int total_count "Default: 0"
        int pending_count "Default: 0"
        json recent_emails "JSON Array"
        datetime created_at "Default: UTC Now"
        datetime updated_at "Default: UTC Now, On Update"
    }
    
    PROCESSED_EMAIL {
        int id PK "Primary Key"
        int user_id FK "Foreign Key to User"
        int folder_id FK "Foreign Key to Folder"
        string email_uid "Not Null" "IMAP Email UID"
        string folder_name "Not Null" "IMAP Folder Name"
        boolean is_processed "Default: False" "Processing Status"
        datetime first_seen_at "Default: UTC Now" "First seen during sync"
        datetime processed_at "Nullable" "When email was processed"
        datetime created_at "Default: UTC Now"
        datetime updated_at "Default: UTC Now, On Update"
    }
    
    USER ||--o{ FOLDER : "has"
    USER ||--o{ PROCESSED_EMAIL : "has"
    FOLDER ||--o{ PROCESSED_EMAIL : "has"

Column Specifications

Table Column Data Type Constraints Description
PROCESSED_EMAIL id Integer Primary Key, Autoincrement Unique identifier for each processed email record
PROCESSED_EMAIL user_id Integer Foreign Key to User, Not Null Reference to the user who owns this email
PROCESSED_EMAIL folder_id Integer Foreign Key to Folder, Not Null Reference to the folder this email belongs to
PROCESSED_EMAIL email_uid String(255) Not Null Unique ID of the email from IMAP server
PROCESSED_EMAIL folder_name String(255) Not Null Name of the IMAP folder (for redundancy)
PROCESSED_EMAIL is_processed Boolean Default: False Processing status (false=pending, true=processed)
PROCESSED_EMAIL first_seen_at DateTime Default: datetime.utcnow First time this email was detected during sync
PROCESSED_EMAIL processed_at DateTime Nullable When the email was marked as processed
PROCESSED_EMAIL created_at DateTime Default: datetime.utcnow Record creation timestamp
PROCESSED_EMAIL updated_at DateTime Default: datetime.utcnow, On Update Record update timestamp

Relationships

  • User to ProcessedEmail: One-to-many relationship - each user can have multiple processed email records
  • Folder to ProcessedEmail: One-to-many relationship - each folder can have multiple processed email records
  • Composite Key: The combination of (user_id, folder_name, email_uid) should be unique to prevent duplicate records

Database Indexes

  • Primary key index on id
  • Foreign key indexes on user_id and folder_id
  • Composite unique index on (user_id, folder_name, email_uid)
  • Index on folder_name for faster folder-based queries
  • Index on is_processed for filtering pending emails
  • Index on first_seen_at for tracking recently added emails

Service Design

ProcessedEmailsService

The ProcessedEmailsService (app/processed_emails_service.py) provides:

class ProcessedEmailsService:
    def __init__(self, user: User):
        self.user = user
    
    def get_pending_emails(self, folder_name: str) -> List[str]:
        """Get list of email UIDs that are pending processing in a folder."""
        
    def mark_email_processed(self, folder_name: str, email_uid: str) -> bool:
        """Mark an email as processed."""
        
    def mark_emails_processed(self, folder_name: str, email_uids: List[str]) -> int:
        """Mark multiple emails as processed in bulk."""
        
    def sync_folder_emails(self, folder_name: str, email_uids: List[str]) -> int:
        """Sync email UIDs for a folder, adding new ones as pending."""
        
    def get_pending_count(self, folder_name: str) -> int:
        """Get count of pending emails for a folder."""
        
    def cleanup_old_records(self, folder_name: str, current_uids: List[str]) -> int:
        """Remove records for emails that no longer exist in the folder."""

IMAPService Integration

The IMAP service (app/imap_service.py) integrates with the ProcessedEmailsService:

class IMAPService:
    def __init__(self, user: User):
        self.user = user
        self.config = user.imap_config or {}
        self.connection = None
        self.processed_emails_service = ProcessedEmailsService(user)
    
    def get_folder_email_count(self, folder_name: str) -> int:
        """Get the count of emails in a specific folder, considering processed status."""
        
    def get_pending_emails(self, folder_name: str) -> List[str]:
        """Get email UIDs that are pending processing."""
        
    def sync_folders(self) -> Tuple[bool, str]:
        """Sync IMAP folders with local database, tracking email processing status."""

API Endpoints

HTMX Endpoints for Processed Email Management

  1. Get Pending Emails for a Folder

    • Method: GET
    • Path: /api/folders/<folder_id>/pending-emails
    • Response: Dialog with list of email metadata for pending emails (subject, date, UID)
    • Features: Email preview, individual processing buttons
  2. Mark Email as Processed

    • Method: POST
    • Path: /api/folders/<folder_id>/emails/<email_uid>/process
    • Action: Mark a specific email as processed
    • Response: Updated dialog body with new counts
  3. Sync Emails for a Folder

    • Method: POST
    • Path: /api/folders/<folder_id>/sync-emails
    • Action: Sync emails for a specific folder with processed email tracking
    • Response: Updated counts and sync status
  4. Process Multiple Emails

    • Method: POST
    • Path: /api/folders/<folder_id>/process-emails
    • Action: Process multiple emails in a folder (mark as processed)
    • Response: Success message with updated counts

Workflow Integration

Email Processing Flow

sequenceDiagram
    participant U as User
    participant B as Browser
    participant M as Main Blueprint
    participant I as IMAP Service
    participant P as ProcessedEmails Service
    participant DB as Database
    
    U->>B: Click "Sync Folders"
    B->>M: POST /api/imap/sync
    M->>I: Sync folders with processed email tracking
    I->>I: Connect to IMAP server
    I->>I: Get list of email UIDs for folder
    I->>P: sync_folder_emails(folder_name, email_uids)
    P->>DB: Create pending email records
    P->>I: Return list of pending email UIDs
    I->>M: Return sync results
    M->>B: Update UI with pending counts

Email Processing Status Update

sequenceDiagram
    participant U as User
    participant B as Browser
    participant M as Main Blueprint
    participant P as ProcessedEmails Service
    participant DB as Database
    
    U->>B: Trigger email processing
    B->>M: POST /api/folders/<folder_id>/process-emails
    M->>P: mark_emails_processed(folder_name, email_uids)
    P->>DB: Update email processing status
    P->>M: Return success count
    M->>B: Update UI with new counts

Migration Strategy

Current Implementation Status

Phase 1: Data Model Implementation

  1. Create the processed_emails table with appropriate indexes
  2. Implement the ProcessedEmailsService class
  3. Add basic CRUD operations for email processing records

Phase 2: IMAP Service Integration

  1. Update IMAPService to use ProcessedEmailsService
  2. Modify folder synchronization to track email UIDs
  3. Update email count methods to consider processing status

Phase 3: API and UI Integration

  1. Add API endpoints for processed email management
  2. Update UI to display accurate pending counts
  3. Add bulk processing capabilities

Phase 4: Optimization and Cleanup

  1. Implement batch processing for performance
  2. Add periodic cleanup of orphaned records
  3. Optimize database queries for large datasets

Security Considerations

  1. Access Control: Ensure users can only access their own email processing records
  2. Data Validation: Validate all email UIDs and folder names to prevent injection attacks
  3. Rate Limiting: Implement rate limiting for email processing endpoints to prevent abuse
  4. Data Privacy: Ensure no sensitive email content is stored in the database

Performance Considerations

  1. Database Indexing: Proper indexing on frequently queried fields
  2. Batch Operations: Use batch operations for processing multiple emails
  3. Memory Management: Process emails in batches to avoid memory issues with large mailboxes
  4. Caching: Consider caching frequently accessed email processing status

Future Enhancements

  1. Email Movement Tracking: Track when emails are moved between folders
  2. Processing History: Maintain a history of email processing actions
  3. Email Deduplication: Handle duplicate emails across folders
  4. Automated Cleanup: Periodic cleanup of old or orphaned processing records
  5. Analytics: Provide insights into email processing patterns and efficiency