Files
red/.planning/phases/04-pdf-ingest/04-RESEARCH.md
2026-03-19 21:24:01 -06:00

24 KiB

Phase 4: PDF Ingest - Research

Researched: 2026-03-19 Domain: PDF file ingestion, SkySlope/URE forms API investigation, PDF rendering, file storage Confidence: MEDIUM

Summary

Phase 4 adds document management to the agent portal: pulling PDF forms from the SkySlope/URE forms library, copying them into a client-specific storage folder, and rendering them in the browser. The critical unknown going into this phase was whether the utahrealestate.com vendor API exposes a forms library. Research confirms the URE Web API is a RESO OData v4 MLS listing data API — it does not expose forms/PDF downloads. The SkySlope forms integration is a member-facing web application (skyslope.utahrealestate.com), not a programmatic API accessible to third-party apps. There is no public SkySlope API for fetching form PDFs.

The practical conclusion: forms must be seeded manually. An agent (or developer) downloads the relevant forms from the SkySlope member portal and places them in a seed directory. The application reads that seed directory to populate the forms library. A cron job or manual re-seed script handles the monthly sync requirement (DOC-02). This approach is explicitly called out in REQUIREMENTS.md as the fallback if the vendor API does not expose forms — and it does not.

For PDF rendering, PDF.js (mozilla/pdf.js) is the correct choice. It renders PDFs natively in the browser with no iframe/plugin quirks, and — critically — it provides a canvas-based layer system that Phase 5 field overlay work will build on. Using a raw <iframe> or <embed> would make Phase 5 field placement impossible.

Primary recommendation: Seed forms from a static directory of manually downloaded PDFs; use react-pdf (which wraps PDF.js) for browser rendering; store per-client document copies under uploads/clients/{clientId}/.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

  • Forms library source: Primary source is SkySlope / URE Legacy Forms Library from utahrealestate.com (MLS Forms + URE Legacy Forms Library). Research needed on public API; scraping is a route if no API exists. Forms library syncs at least monthly. File picker upload is a backup for custom/non-standard forms.
  • Upload entry point: Agent uploads from client profile page via "Add Document" button. Modal shows forms library list with search. File picker option also in modal.
  • Document naming: Modal pre-fills name from template name. Agent can edit. Name only — no extra metadata. Status auto-sets to Draft. Multiple instances of same template allowed.
  • File storage: When agent adds a template to a client, a copy is saved to uploads/clients/{clientId}/. Soft delete: record hidden from UI, file kept on disk.
  • PDF viewer: Minimal chrome. PDF fills the page. Controls: page nav, zoom, download. Back link to client profile. Render method: Claude's discretion (researcher evaluate PDF.js vs browser embed — PDF.js preferred if it sets up Phase 5 field overlay cleanly).
  • Post-upload flow: Stay on client profile after upload. New document appears in documents list. Progress indicator inside modal while saving.

Claude's Discretion

  • PDF rendering library choice (PDF.js vs iframe/embed)
  • Exact storage path conventions within uploads/clients/{id}/
  • Error handling for failed uploads or missing templates
  • Forms library sync mechanism implementation details

Deferred Ideas (OUT OF SCOPE)

  • None — discussion stayed within phase scope </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
DOC-01 Agent can browse and import PDF forms from the utahrealestate.com vendor API (vendor.utahrealestate.com/webapi) — investigate API capability; fall back to manual upload if forms API is not available URE Web API is MLS listing data only (RESO OData v4). No forms/PDF endpoint exists. Fallback to manual seed directory is the correct path.
DOC-02 Forms library syncs automatically on at least a monthly basis to reflect new/updated forms Implement a seed script (scripts/seed-forms.ts) that reads a local seeds/forms/ directory. "Sync" = re-run seed script after agent manually downloads updated PDFs from SkySlope portal. Can be triggered via npm script or cron.
DOC-03 Agent can view an imported PDF document in the browser Use react-pdf (PDF.js wrapper). Renders pages on canvas — required for Phase 5 field overlay. Supports page navigation and zoom.
</phase_requirements>

Standard Stack

Core

Library Version Purpose Why Standard
react-pdf ^9.x PDF rendering in browser Wraps PDF.js; canvas-based rendering enables Phase 5 field overlays; maintained by wojtekmaj; widely used in React ecosystem
multer ^1.4.x Multipart form handling for file picker uploads Node.js standard for Express/Next.js API route file uploads
uuid ^11.x Generate unique filenames for stored PDFs Prevents collisions; already likely in project from Phase 3

Supporting

Library Version Purpose When to Use
pdfjs-dist ^4.x PDF.js core (peer dep of react-pdf) Installed automatically with react-pdf
node:fs/promises built-in Read seed directory, copy files to client folders Server-side file operations without extra deps
node:path built-in Safe path construction for uploads dir Prevents path traversal bugs
node:crypto built-in Generate unique doc IDs if uuid not present Fallback only

Alternatives Considered

Instead of Could Use Tradeoff
react-pdf <iframe src="..."> or <embed> iframe/embed works for DOC-03 in isolation but makes Phase 5 field placement on canvas impossible — ruled out
react-pdf @react-pdf/renderer renderer is for GENERATING PDFs, not VIEWING them — wrong tool
multer Next.js formData() built-in Built-in works for small files; multer gives streaming + size limits for large PDFs
manual seed dir Scraping SkySlope portal Scraping violates ToS (explicit in REQUIREMENTS.md Out of Scope table); no public API exists

Installation:

npm install react-pdf multer uuid
npm install --save-dev @types/multer

Note: react-pdf v9+ requires pdfjs-dist as a peer dep. Follow react-pdf docs for worker configuration in Next.js.

Architecture Patterns

src/
├── app/
│   ├── portal/
│   │   ├── clients/[id]/
│   │   │   └── page.tsx              # Already exists (Phase 3) — add "Add Document" button
│   │   └── documents/[docId]/
│   │       └── page.tsx              # NEW: PDF viewer page
│   └── api/
│       ├── documents/
│       │   ├── route.ts              # POST: create document record (from library or file picker)
│       │   └── [id]/
│       │       └── route.ts          # GET: serve PDF file (authenticated); DELETE: soft delete
│       └── forms-library/
│           └── route.ts              # GET: list available seed forms
seeds/
└── forms/                            # Manually downloaded PDFs from SkySlope portal
    ├── purchase-agreement.pdf
    ├── listing-agreement.pdf
    └── ...
uploads/
└── clients/
    └── {clientId}/
        └── {uuid}.pdf               # Per-client document copies
scripts/
└── seed-forms.ts                    # Reads seeds/forms/, upserts into form_templates table

Pattern 1: Forms Library Seed Table

What: A form_templates table in the database holds metadata (name, filename) for each seeded PDF. The seeds/forms/ directory holds the actual PDF files. The seed script syncs the directory to the table.

When to use: All forms library browsing queries hit the DB (fast, searchable). File reads only happen when a form is copied to a client folder.

Example:

// scripts/seed-forms.ts
import { readdir } from 'node:fs/promises';
import path from 'node:path';
import { db } from '@/db';
import { formTemplates } from '@/db/schema';

const SEEDS_DIR = path.join(process.cwd(), 'seeds/forms');

async function seedForms() {
  const files = await readdir(SEEDS_DIR);
  const pdfs = files.filter(f => f.endsWith('.pdf'));

  for (const filename of pdfs) {
    const name = filename.replace('.pdf', '').replace(/-/g, ' ');
    await db.insert(formTemplates)
      .values({ name, filename })
      .onConflictDoUpdate({
        target: formTemplates.filename,
        set: { name, updatedAt: new Date() }
      });
  }
  console.log(`Seeded ${pdfs.length} forms`);
}

seedForms();

Pattern 2: Copy-on-Add (Template → Client Document)

What: When agent adds a form to a client, copy the seed PDF to uploads/clients/{clientId}/{uuid}.pdf. Insert a documents record pointing to the copy. Phase 5 writes field data against the copy only, never the template.

When to use: Every time agent clicks "Add Document" from the library.

Example:

// src/app/api/documents/route.ts (POST handler)
import { copyFile, mkdir } from 'node:fs/promises';
import path from 'node:path';
import { v4 as uuidv4 } from 'uuid';

export async function POST(req: Request) {
  const { clientId, formTemplateId, name } = await req.json();

  const template = await db.query.formTemplates.findFirst({
    where: eq(formTemplates.id, formTemplateId)
  });

  const docId = uuidv4();
  const destDir = path.join(process.cwd(), 'uploads/clients', clientId);
  const destPath = path.join(destDir, `${docId}.pdf`);
  const srcPath = path.join(process.cwd(), 'seeds/forms', template.filename);

  await mkdir(destDir, { recursive: true });
  await copyFile(srcPath, destPath);

  const [doc] = await db.insert(documents).values({
    id: docId,
    clientId,
    formTemplateId,
    name,
    filePath: `clients/${clientId}/${docId}.pdf`,
    status: 'draft'
  }).returning();

  return Response.json(doc, { status: 201 });
}

Pattern 3: Authenticated PDF Serving

What: PDFs are served via an API route that checks auth before streaming the file. Never expose the uploads/ directory as a static asset.

When to use: The PDF viewer page fetches the PDF via /api/documents/{id}/file — an authenticated GET route.

Example:

// src/app/api/documents/[id]/file/route.ts
import { auth } from '@/auth';
import { readFile } from 'node:fs/promises';
import path from 'node:path';

export async function GET(req: Request, { params }: { params: { id: string } }) {
  const session = await auth();
  if (!session) return new Response('Unauthorized', { status: 401 });

  const doc = await db.query.documents.findFirst({
    where: eq(documents.id, params.id)
  });
  if (!doc) return new Response('Not found', { status: 404 });

  const filePath = path.join(process.cwd(), 'uploads', doc.filePath);
  const buffer = await readFile(filePath);

  return new Response(buffer, {
    headers: { 'Content-Type': 'application/pdf' }
  });
}

Pattern 4: react-pdf Viewer Component

What: Client component using react-pdf to render PDF pages on canvas. Supports page navigation and zoom.

When to use: Document detail page (/portal/documents/[docId]).

Example:

// src/app/portal/documents/[docId]/_components/PdfViewer.tsx
'use client';
import { useState } from 'react';
import { Document, Page, pdfjs } from 'react-pdf';
import 'react-pdf/dist/Page/AnnotationLayer.css';
import 'react-pdf/dist/Page/TextLayer.css';

// Worker setup — required for Next.js
pdfjs.GlobalWorkerOptions.workerSrc = new URL(
  'pdfjs-dist/build/pdf.worker.min.mjs',
  import.meta.url
).toString();

export function PdfViewer({ docId }: { docId: string }) {
  const [numPages, setNumPages] = useState<number>(0);
  const [pageNumber, setPageNumber] = useState(1);
  const [scale, setScale] = useState(1.0);

  return (
    <div>
      <Document
        file={`/api/documents/${docId}/file`}
        onLoadSuccess={({ numPages }) => setNumPages(numPages)}
      >
        <Page pageNumber={pageNumber} scale={scale} />
      </Document>
      <div>
        <button onClick={() => setPageNumber(p => Math.max(1, p - 1))}>Prev</button>
        <span>{pageNumber} / {numPages}</span>
        <button onClick={() => setPageNumber(p => Math.min(numPages, p + 1))}>Next</button>
        <button onClick={() => setScale(s => s + 0.2)}>Zoom In</button>
        <button onClick={() => setScale(s => Math.max(0.4, s - 0.2))}>Zoom Out</button>
      </div>
    </div>
  );
}

What: Client component modal. Fetches form templates list from /api/forms-library. Filters client-side by name as agent types. Has separate "Browse files" button for custom PDF upload.

When to use: "Add Document" button on client profile page.

Schema Additions

// Addition to src/db/schema.ts

export const formTemplates = pgTable('form_templates', {
  id: serial('id').primaryKey(),
  name: text('name').notNull(),
  filename: text('filename').notNull().unique(),  // filename in seeds/forms/
  createdAt: timestamp('created_at').defaultNow(),
  updatedAt: timestamp('updated_at').defaultNow(),
});

// documents table already partially defined in Phase 3 (documentStatusEnum exists)
// Add columns:
//   formTemplateId: integer('form_template_id').references(() => formTemplates.id)
//   filePath: text('file_path').notNull()  // relative path within uploads/
//   name: text('name').notNull()

Anti-Patterns to Avoid

  • Serving uploads as static assets: Never put uploads/ under public/. All PDF serving must go through authenticated API routes.
  • Storing absolute paths in DB: Store relative paths only (e.g., clients/{id}/{uuid}.pdf). Absolute paths break when the Docker volume mount changes.
  • Mutating seed templates: The seeds/forms/ directory is read-only source of truth. Never write or delete files there programmatically.
  • Using <iframe src="/api/documents/id/file"> for the viewer: Works for DOC-03 but blocks Phase 5 canvas overlay. Use react-pdf.
  • Loading the PDF worker from CDN: In local/Docker environment with no internet guarantee, configure the worker from node_modules via import.meta.url.

Don't Hand-Roll

Problem Don't Build Use Instead Why
PDF rendering in browser Custom canvas renderer react-pdf (PDF.js) PDF rendering handles fonts, embedded images, encodings, cross-browser canvas quirks — thousands of edge cases
Multipart file upload parsing Custom body parser multer or Next.js built-in formData Boundary parsing, temp file buffering, size limits — complex protocol details
Unique filename generation Date.now() + Math.random() uuid v4 UUID v4 provides genuine collision resistance; timestamp-based names can collide under concurrent requests
Path traversal prevention String sanitization path.join() + allowlist check path.join normalizes ../ sequences; add explicit check that resolved path starts with allowed base dir

Key insight: PDF rendering looks simple (just show the bytes) but requires a complete PDF interpreter for production use. PDF.js is the only production-grade open-source PDF renderer for the browser.

Common Pitfalls

Pitfall 1: react-pdf Worker Not Configured for Next.js

What goes wrong: pdfjs.GlobalWorkerOptions.workerSrc is not set, or set to a CDN URL. PDF renders blank or throws "Setting up fake worker" warning. In Next.js with webpack, the worker import requires special handling.

Why it happens: PDF.js uses a Web Worker for parsing. Next.js webpack doesn't automatically handle .worker.mjs imports.

How to avoid: Set workerSrc using new URL('pdfjs-dist/build/pdf.worker.min.mjs', import.meta.url).toString() inside the component. This is the officially documented pattern for bundlers. Do NOT use cdnjs.cloudflare.com URL in a local/offline environment.

Warning signs: Console shows "Setting up fake worker" or PDF page renders blank.

Pitfall 2: Path Traversal in File Serving Route

What goes wrong: doc.filePath from the DB contains ../../etc/passwd. The file serving route reads it and returns sensitive data.

Why it happens: DB record was tampered with, or filePath validation was skipped.

How to avoid: After building the absolute path with path.join(), assert that it starts with the expected uploads/ base directory before reading:

const UPLOADS_BASE = path.join(process.cwd(), 'uploads');
if (!filePath.startsWith(UPLOADS_BASE)) {
  return new Response('Forbidden', { status: 403 });
}

Warning signs: Any .. sequence appearing in a file path.

Pitfall 3: uploads/ Excluded from Docker Volume

What goes wrong: Files are saved to uploads/ successfully in development but disappear on container restart because uploads/ is not mounted as a Docker volume.

Why it happens: Docker containers have ephemeral filesystems by default. If the volume mount targets /app but excludes subdirectories, files don't persist.

How to avoid: Ensure docker-compose.yml mounts ./uploads:/app/uploads as a named or bind-mount volume. Validate survival of uploads across container restarts as an explicit acceptance criterion.

Warning signs: Uploaded files exist in dev but disappear after docker compose restart.

Pitfall 4: react-pdf CSS Not Imported

What goes wrong: PDF renders but text is missing or annotation links are invisible.

Why it happens: react-pdf requires explicit CSS imports for the annotation and text layers.

How to avoid: Import both CSS files in the viewer component:

import 'react-pdf/dist/Page/AnnotationLayer.css';
import 'react-pdf/dist/Page/TextLayer.css';

Pitfall 5: File Size Limits on API Route

What goes wrong: Agent uploads a large PDF via the file picker (some real estate contracts are 10-15MB). Next.js API routes have a default 4MB body size limit.

Why it happens: Next.js bodyParser limit is 4MB by default.

How to avoid: Export config from the API route to disable body parsing and use multer for streaming, or increase the limit:

export const config = { api: { bodyParser: { sizeLimit: '20mb' } } };

For App Router, use formData() with multer middleware configured for appropriate limits.

Pitfall 6: Missing seeds/forms/ Directory in Docker Build

What goes wrong: Container starts but forms library is empty because seeds/forms/ was not included in the Docker image or volume mount.

Why it happens: .dockerignore or volume mounts exclude the seeds directory.

How to avoid: Either include seeds/forms/ in the Docker image (acceptable since these are non-sensitive read-only PDFs), or mount it as a read-only volume. Document the required setup step in a project README.

Code Examples

Seed Script Run Command

# package.json scripts
"seed:forms": "npx tsx scripts/seed-forms.ts"
# Monthly re-sync: download new PDFs to seeds/forms/, re-run seed:forms

Forms Library API Route

// src/app/api/forms-library/route.ts
import { auth } from '@/auth';
import { db } from '@/db';
import { formTemplates } from '@/db/schema';

export async function GET() {
  const session = await auth();
  if (!session) return new Response('Unauthorized', { status: 401 });

  const forms = await db.select({
    id: formTemplates.id,
    name: formTemplates.name,
  }).from(formTemplates).orderBy(formTemplates.name);

  return Response.json(forms);
}

Drizzle Migration for New Tables

npx drizzle-kit generate  # generates new migration
npx drizzle-kit migrate   # applies migration

State of the Art

Old Approach Current Approach When Changed Impact
pdf.js manual integration react-pdf wrapper ~2018 react-pdf handles worker config, React lifecycle, and canvas management
URE RETS data feed RESO Web API (OData v4) 2020 Listing data only — no forms endpoint has ever been exposed
SkySlope separate product SkySlope integrated into URE member portal ~2022 Forms are now SkySlope-managed; no independent URE forms API

Deprecated/outdated:

  • RETS protocol: URE killed RETS in 2020; all data feeds now via RESO Web API
  • @react-pdf/renderer: This is a PDF generator not a viewer — completely different use case, do not confuse

Open Questions

  1. Does the vendor.utahrealestate.com/webapi expose any forms endpoints?

    • What we know: The API is RESO-certified OData v4. Documented endpoints cover MLS listing resources (Property, Member, Office, Media, OpenHouse). No forms/documents endpoint is documented publicly.
    • What's unclear: Whether a vendor account with elevated permissions might expose additional endpoints. The SkySlope integration is member-facing only.
    • Recommendation: Proceed with manual seed directory approach (DOC-01 fallback explicitly stated in REQUIREMENTS.md). If Teressa has vendor API credentials, check the /api/$metadata endpoint for any Form or Document resource — but don't block Phase 4 on this.
  2. react-pdf v9 Next.js App Router compatibility

    • What we know: react-pdf v9 ships as ESM. Next.js App Router has improved ESM support but some configurations require transpilePackages.
    • What's unclear: Whether react-pdf needs to be added to next.config.js transpilePackages array for this project's Next.js version.
    • Recommendation: After installing, if build errors appear referencing react-pdf ESM, add transpilePackages: ['react-pdf', 'pdfjs-dist'] to next.config.js.
  3. How many seed forms are needed initially?

    • What we know: Teressa uses forms from the SkySlope/URE portal. The exact list is not known.
    • What's unclear: Whether 5 forms or 50+ forms are typical for a Utah real estate agent's workflow.
    • Recommendation: Start with the most common forms (Purchase Agreement, Listing Agreement, Buyer Representation Agreement, Counter Offer, Addendum). Agent can expand the seed directory as needed.

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

  • SkySlope press release: URE partnership — Confirms SkySlope Forms is the forms product for URE members; member benefit, not API
  • react-pdf documentation and GitHub (wojtekmaj/react-pdf) — Canvas-based PDF rendering pattern, worker configuration for bundlers
  • REQUIREMENTS.md Out of Scope table — Explicitly lists "utahrealestate.com forms scraping (automated credential login) — Likely violates ToS; use vendor API or manual upload instead"

Tertiary (LOW confidence)

  • WebSearch results on URE GitHub organization — No forms-related repositories found; confirms no open-source forms API

Metadata

Confidence breakdown:

  • Standard stack (react-pdf, multer, uuid): MEDIUM — react-pdf is well-established but Next.js App Router ESM compatibility needs verification at install time
  • Architecture (seed dir, copy-on-add, authenticated serving): HIGH — standard file management patterns, no novel approaches
  • SkySlope/URE API finding (no forms API): MEDIUM — based on public documentation; a vendor account might reveal more endpoints, but public evidence strongly indicates forms are member-portal-only
  • Pitfalls: HIGH — react-pdf worker config, path traversal, Docker volume persistence are well-documented real-world issues

Research date: 2026-03-19 Valid until: 2026-04-19 (stable domain; re-check react-pdf major version if >30 days)