Files
email-organizer/docs/design/processed-emails-spec.md
2025-08-06 15:38:49 -07:00

261 lines
11 KiB
Markdown

# Processed Emails Feature Specification
## Overview
This document outlines the specification for implementing a feature to persistently track which emails have been processed by the Email Organizer system. The goal is to maintain a record of email processing status to avoid reprocessing the same emails during synchronization and provide accurate pending email counts.
## Requirements
### 1. Email Tracking Requirements
- **Unique Email Identification**: Track emails using a unique identifier (UID) provided by the IMAP server, along with the folder name and user ID
- **Processing Status**: Mark emails as either "pending" (unprocessed) or "processed"
- **Minimal Data Storage**: Store only essential information - email UID, folder, user, and processing status - not email content, subjects, or bodies
- **Persistence**: Maintain processing status across application restarts and synchronization cycles
- **Efficient Lookup**: Quickly determine which emails in a folder are pending processing
### 2. Synchronization Requirements
- **Initial Sync**: During first synchronization of a folder, all emails should be marked as "pending"
- **Incremental Sync**: On subsequent syncs, only emails that haven't been processed should be identified as pending
- **Status Update**: When an email is processed, update its status from "pending" to "processed"
- **Cleanup**: Remove records for emails that no longer exist on the IMAP server (optional for future enhancement)
### 3. Performance Requirements
- **Efficient Storage**: Use appropriate database indexing for fast lookups
- **Minimal Memory Usage**: Store only essential data to keep memory footprint low
- **Batch Processing**: Support batch operations for processing multiple emails efficiently
## Data Model Design
### ProcessedEmails Table
```mermaid
erDiagram
USER {
int id PK "Primary Key"
string email "Unique, Not Null"
string first_name "Not Null"
string last_name "Not Null"
string password_hash "Not Null"
json imap_config "JSON Configuration"
datetime created_at "Default: UTC Now"
datetime updated_at "Default: UTC Now, On Update"
}
FOLDER {
int id PK "Primary Key"
int user_id FK "Foreign Key to User"
string name "Not Null"
text rule_text "Natural Language Rule"
int priority "Processing Order"
boolean organize_enabled "Default: True"
int total_count "Default: 0"
int pending_count "Default: 0"
json recent_emails "JSON Array"
datetime created_at "Default: UTC Now"
datetime updated_at "Default: UTC Now, On Update"
}
PROCESSED_EMAIL {
int id PK "Primary Key"
int user_id FK "Foreign Key to User"
int folder_id FK "Foreign Key to Folder"
string email_uid "Not Null" "IMAP Email UID"
string folder_name "Not Null" "IMAP Folder Name"
boolean is_processed "Default: False" "Processing Status"
datetime first_seen_at "Default: UTC Now" "First seen during sync"
datetime processed_at "Nullable" "When email was processed"
datetime created_at "Default: UTC Now"
datetime updated_at "Default: UTC Now, On Update"
}
USER ||--o{ FOLDER : "has"
USER ||--o{ PROCESSED_EMAIL : "has"
FOLDER ||--o{ PROCESSED_EMAIL : "has"
```
### Column Specifications
| Table | Column | Data Type | Constraints | Description |
|-------|--------|-----------|--------------|-------------|
| PROCESSED_EMAIL | id | Integer | Primary Key, Autoincrement | Unique identifier for each processed email record |
| PROCESSED_EMAIL | user_id | Integer | Foreign Key to User, Not Null | Reference to the user who owns this email |
| PROCESSED_EMAIL | folder_id | Integer | Foreign Key to Folder, Not Null | Reference to the folder this email belongs to |
| PROCESSED_EMAIL | email_uid | String(255) | Not Null | Unique ID of the email from IMAP server |
| PROCESSED_EMAIL | folder_name | String(255) | Not Null | Name of the IMAP folder (for redundancy) |
| PROCESSED_EMAIL | is_processed | Boolean | Default: False | Processing status (false=pending, true=processed) |
| PROCESSED_EMAIL | first_seen_at | DateTime | Default: datetime.utcnow | First time this email was detected during sync |
| PROCESSED_EMAIL | processed_at | DateTime | Nullable | When the email was marked as processed |
| PROCESSED_EMAIL | created_at | DateTime | Default: datetime.utcnow | Record creation timestamp |
| PROCESSED_EMAIL | updated_at | DateTime | Default: datetime.utcnow, On Update | Record update timestamp |
### Relationships
- **User to ProcessedEmail**: One-to-many relationship - each user can have multiple processed email records
- **Folder to ProcessedEmail**: One-to-many relationship - each folder can have multiple processed email records
- **Composite Key**: The combination of (user_id, folder_name, email_uid) should be unique to prevent duplicate records
### Database Indexes
- Primary key index on `id`
- Foreign key indexes on `user_id` and `folder_id`
- Composite unique index on `(user_id, folder_name, email_uid)`
- Index on `folder_name` for faster folder-based queries
- Index on `is_processed` for filtering pending emails
- Index on `first_seen_at` for tracking recently added emails
## Service Design
### ProcessedEmailsService
A new service class will be responsible for managing processed email records:
```python
class ProcessedEmailsService:
def __init__(self, user: User):
self.user = user
def get_pending_emails(self, folder_name: str) -> List[str]:
"""Get list of email UIDs that are pending processing in a folder."""
def mark_email_processed(self, folder_name: str, email_uid: str) -> bool:
"""Mark an email as processed."""
def mark_emails_processed(self, folder_name: str, email_uids: List[str]) -> int:
"""Mark multiple emails as processed in bulk."""
def sync_folder_emails(self, folder_name: str, email_uids: List[str]) -> int:
"""Sync email UIDs for a folder, adding new ones as pending."""
def get_pending_count(self, folder_name: str) -> int:
"""Get count of pending emails for a folder."""
def cleanup_old_records(self, folder_name: str, current_uids: List[str]) -> int:
"""Remove records for emails that no longer exist in the folder."""
```
### IMAPService Integration
The existing IMAP service will be enhanced to use the ProcessedEmailsService:
```python
class IMAPService:
def __init__(self, user: User):
self.user = user
self.config = user.imap_config or {}
self.connection = None
self.processed_emails_service = ProcessedEmailsService(user)
def get_folder_email_count(self, folder_name: str) -> int:
"""Get the count of emails in a specific folder, considering processed status."""
def get_pending_emails(self, folder_name: str) -> List[str]:
"""Get email UIDs that are pending processing."""
def sync_folders(self) -> Tuple[bool, str]:
"""Sync IMAP folders with local database, tracking email processing status."""
```
## API Endpoints
### New HTMX Endpoints for Processed Email Management
1. **Get Pending Emails for a Folder**
- Method: GET
- Path: `/api/folders/<folder_id>/pending-emails`
- Response: An Dialog List of email metadata for pending emails (subject, date, UID), a button to preview the email (fetch it from the imap server)
2. **Mark Email as Processed**
- Method: POST
- Path: `/api/folders/<folder_id>/emails/<email_uid>/process`
- Action: Mark a specific email as processed
- Response: Updated dialog body.
## Workflow Integration
### Email Processing Flow
```mermaid
sequenceDiagram
participant U as User
participant B as Browser
participant M as Main Blueprint
participant I as IMAP Service
participant P as ProcessedEmails Service
participant DB as Database
U->>B: Click "Sync Folders"
B->>M: POST /api/imap/sync
M->>I: Sync folders with processed email tracking
I->>I: Connect to IMAP server
I->>I: Get list of email UIDs for folder
I->>P: sync_folder_emails(folder_name, email_uids)
P->>DB: Create pending email records
P->>I: Return list of pending email UIDs
I->>M: Return sync results
M->>B: Update UI with pending counts
```
### Email Processing Status Update
```mermaid
sequenceDiagram
participant U as User
participant B as Browser
participant M as Main Blueprint
participant P as ProcessedEmails Service
participant DB as Database
U->>B: Trigger email processing
B->>M: POST /api/folders/<folder_id>/process-emails
M->>P: mark_emails_processed(folder_name, email_uids)
P->>DB: Update email processing status
P->>M: Return success count
M->>B: Update UI with new counts
```
## Migration Strategy
### Phase 1: Data Model Implementation
1. Create the `processed_emails` table with appropriate indexes
2. Implement the `ProcessedEmailsService` class
3. Add basic CRUD operations for email processing records
### Phase 2: IMAP Service Integration
1. Update `IMAPService` to use `ProcessedEmailsService`
2. Modify folder synchronization to track email UIDs
3. Update email count methods to consider processing status
### Phase 3: API and UI Integration
1. Add API endpoints for processed email management
2. Update UI to display accurate pending counts
3. Add bulk processing capabilities
### Phase 4: Optimization and Cleanup
1. Implement batch processing for performance
2. Add periodic cleanup of orphaned records
3. Optimize database queries for large datasets
## Security Considerations
1. **Access Control**: Ensure users can only access their own email processing records
2. **Data Validation**: Validate all email UIDs and folder names to prevent injection attacks
3. **Rate Limiting**: Implement rate limiting for email processing endpoints to prevent abuse
4. **Data Privacy**: Ensure no sensitive email content is stored in the database
## Performance Considerations
1. **Database Indexing**: Proper indexing on frequently queried fields
2. **Batch Operations**: Use batch operations for processing multiple emails
3. **Memory Management**: Process emails in batches to avoid memory issues with large mailboxes
4. **Caching**: Consider caching frequently accessed email processing status
## Future Enhancements
1. **Email Movement Tracking**: Track when emails are moved between folders
2. **Processing History**: Maintain a history of email processing actions
3. **Email Deduplication**: Handle duplicate emails across folders
4. **Automated Cleanup**: Periodic cleanup of old or orphaned processing records
5. **Analytics**: Provide insights into email processing patterns and efficiency