24 KiB
Phase 4: PDF Ingest - Research
Researched: 2026-03-19 Domain: PDF file ingestion, SkySlope/URE forms API investigation, PDF rendering, file storage Confidence: MEDIUM
Summary
Phase 4 adds document management to the agent portal: pulling PDF forms from the SkySlope/URE forms library, copying them into a client-specific storage folder, and rendering them in the browser. The critical unknown going into this phase was whether the utahrealestate.com vendor API exposes a forms library. Research confirms the URE Web API is a RESO OData v4 MLS listing data API — it does not expose forms/PDF downloads. The SkySlope forms integration is a member-facing web application (skyslope.utahrealestate.com), not a programmatic API accessible to third-party apps. There is no public SkySlope API for fetching form PDFs.
The practical conclusion: forms must be seeded manually. An agent (or developer) downloads the relevant forms from the SkySlope member portal and places them in a seed directory. The application reads that seed directory to populate the forms library. A cron job or manual re-seed script handles the monthly sync requirement (DOC-02). This approach is explicitly called out in REQUIREMENTS.md as the fallback if the vendor API does not expose forms — and it does not.
For PDF rendering, PDF.js (mozilla/pdf.js) is the correct choice. It renders PDFs natively in the browser with no iframe/plugin quirks, and — critically — it provides a canvas-based layer system that Phase 5 field overlay work will build on. Using a raw <iframe> or <embed> would make Phase 5 field placement impossible.
Primary recommendation: Seed forms from a static directory of manually downloaded PDFs; use react-pdf (which wraps PDF.js) for browser rendering; store per-client document copies under uploads/clients/{clientId}/.
<user_constraints>
User Constraints (from CONTEXT.md)
Locked Decisions
- Forms library source: Primary source is SkySlope / URE Legacy Forms Library from utahrealestate.com (MLS Forms + URE Legacy Forms Library). Research needed on public API; scraping is a route if no API exists. Forms library syncs at least monthly. File picker upload is a backup for custom/non-standard forms.
- Upload entry point: Agent uploads from client profile page via "Add Document" button. Modal shows forms library list with search. File picker option also in modal.
- Document naming: Modal pre-fills name from template name. Agent can edit. Name only — no extra metadata. Status auto-sets to Draft. Multiple instances of same template allowed.
- File storage: When agent adds a template to a client, a copy is saved to
uploads/clients/{clientId}/. Soft delete: record hidden from UI, file kept on disk. - PDF viewer: Minimal chrome. PDF fills the page. Controls: page nav, zoom, download. Back link to client profile. Render method: Claude's discretion (researcher evaluate PDF.js vs browser embed — PDF.js preferred if it sets up Phase 5 field overlay cleanly).
- Post-upload flow: Stay on client profile after upload. New document appears in documents list. Progress indicator inside modal while saving.
Claude's Discretion
- PDF rendering library choice (PDF.js vs iframe/embed)
- Exact storage path conventions within
uploads/clients/{id}/ - Error handling for failed uploads or missing templates
- Forms library sync mechanism implementation details
Deferred Ideas (OUT OF SCOPE)
- None — discussion stayed within phase scope </user_constraints>
<phase_requirements>
Phase Requirements
| ID | Description | Research Support |
|---|---|---|
| DOC-01 | Agent can browse and import PDF forms from the utahrealestate.com vendor API (vendor.utahrealestate.com/webapi) — investigate API capability; fall back to manual upload if forms API is not available | URE Web API is MLS listing data only (RESO OData v4). No forms/PDF endpoint exists. Fallback to manual seed directory is the correct path. |
| DOC-02 | Forms library syncs automatically on at least a monthly basis to reflect new/updated forms | Implement a seed script (scripts/seed-forms.ts) that reads a local seeds/forms/ directory. "Sync" = re-run seed script after agent manually downloads updated PDFs from SkySlope portal. Can be triggered via npm script or cron. |
| DOC-03 | Agent can view an imported PDF document in the browser | Use react-pdf (PDF.js wrapper). Renders pages on canvas — required for Phase 5 field overlay. Supports page navigation and zoom. |
| </phase_requirements> |
Standard Stack
Core
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| react-pdf | ^9.x | PDF rendering in browser | Wraps PDF.js; canvas-based rendering enables Phase 5 field overlays; maintained by wojtekmaj; widely used in React ecosystem |
| multer | ^1.4.x | Multipart form handling for file picker uploads | Node.js standard for Express/Next.js API route file uploads |
| uuid | ^11.x | Generate unique filenames for stored PDFs | Prevents collisions; already likely in project from Phase 3 |
Supporting
| Library | Version | Purpose | When to Use |
|---|---|---|---|
| pdfjs-dist | ^4.x | PDF.js core (peer dep of react-pdf) | Installed automatically with react-pdf |
| node:fs/promises | built-in | Read seed directory, copy files to client folders | Server-side file operations without extra deps |
| node:path | built-in | Safe path construction for uploads dir | Prevents path traversal bugs |
| node:crypto | built-in | Generate unique doc IDs if uuid not present | Fallback only |
Alternatives Considered
| Instead of | Could Use | Tradeoff |
|---|---|---|
| react-pdf | <iframe src="..."> or <embed> |
iframe/embed works for DOC-03 in isolation but makes Phase 5 field placement on canvas impossible — ruled out |
| react-pdf | @react-pdf/renderer | renderer is for GENERATING PDFs, not VIEWING them — wrong tool |
| multer | Next.js formData() built-in | Built-in works for small files; multer gives streaming + size limits for large PDFs |
| manual seed dir | Scraping SkySlope portal | Scraping violates ToS (explicit in REQUIREMENTS.md Out of Scope table); no public API exists |
Installation:
npm install react-pdf multer uuid
npm install --save-dev @types/multer
Note: react-pdf v9+ requires pdfjs-dist as a peer dep. Follow react-pdf docs for worker configuration in Next.js.
Architecture Patterns
Recommended Project Structure
src/
├── app/
│ ├── portal/
│ │ ├── clients/[id]/
│ │ │ └── page.tsx # Already exists (Phase 3) — add "Add Document" button
│ │ └── documents/[docId]/
│ │ └── page.tsx # NEW: PDF viewer page
│ └── api/
│ ├── documents/
│ │ ├── route.ts # POST: create document record (from library or file picker)
│ │ └── [id]/
│ │ └── route.ts # GET: serve PDF file (authenticated); DELETE: soft delete
│ └── forms-library/
│ └── route.ts # GET: list available seed forms
seeds/
└── forms/ # Manually downloaded PDFs from SkySlope portal
├── purchase-agreement.pdf
├── listing-agreement.pdf
└── ...
uploads/
└── clients/
└── {clientId}/
└── {uuid}.pdf # Per-client document copies
scripts/
└── seed-forms.ts # Reads seeds/forms/, upserts into form_templates table
Pattern 1: Forms Library Seed Table
What: A form_templates table in the database holds metadata (name, filename) for each seeded PDF. The seeds/forms/ directory holds the actual PDF files. The seed script syncs the directory to the table.
When to use: All forms library browsing queries hit the DB (fast, searchable). File reads only happen when a form is copied to a client folder.
Example:
// scripts/seed-forms.ts
import { readdir } from 'node:fs/promises';
import path from 'node:path';
import { db } from '@/db';
import { formTemplates } from '@/db/schema';
const SEEDS_DIR = path.join(process.cwd(), 'seeds/forms');
async function seedForms() {
const files = await readdir(SEEDS_DIR);
const pdfs = files.filter(f => f.endsWith('.pdf'));
for (const filename of pdfs) {
const name = filename.replace('.pdf', '').replace(/-/g, ' ');
await db.insert(formTemplates)
.values({ name, filename })
.onConflictDoUpdate({
target: formTemplates.filename,
set: { name, updatedAt: new Date() }
});
}
console.log(`Seeded ${pdfs.length} forms`);
}
seedForms();
Pattern 2: Copy-on-Add (Template → Client Document)
What: When agent adds a form to a client, copy the seed PDF to uploads/clients/{clientId}/{uuid}.pdf. Insert a documents record pointing to the copy. Phase 5 writes field data against the copy only, never the template.
When to use: Every time agent clicks "Add Document" from the library.
Example:
// src/app/api/documents/route.ts (POST handler)
import { copyFile, mkdir } from 'node:fs/promises';
import path from 'node:path';
import { v4 as uuidv4 } from 'uuid';
export async function POST(req: Request) {
const { clientId, formTemplateId, name } = await req.json();
const template = await db.query.formTemplates.findFirst({
where: eq(formTemplates.id, formTemplateId)
});
const docId = uuidv4();
const destDir = path.join(process.cwd(), 'uploads/clients', clientId);
const destPath = path.join(destDir, `${docId}.pdf`);
const srcPath = path.join(process.cwd(), 'seeds/forms', template.filename);
await mkdir(destDir, { recursive: true });
await copyFile(srcPath, destPath);
const [doc] = await db.insert(documents).values({
id: docId,
clientId,
formTemplateId,
name,
filePath: `clients/${clientId}/${docId}.pdf`,
status: 'draft'
}).returning();
return Response.json(doc, { status: 201 });
}
Pattern 3: Authenticated PDF Serving
What: PDFs are served via an API route that checks auth before streaming the file. Never expose the uploads/ directory as a static asset.
When to use: The PDF viewer page fetches the PDF via /api/documents/{id}/file — an authenticated GET route.
Example:
// src/app/api/documents/[id]/file/route.ts
import { auth } from '@/auth';
import { readFile } from 'node:fs/promises';
import path from 'node:path';
export async function GET(req: Request, { params }: { params: { id: string } }) {
const session = await auth();
if (!session) return new Response('Unauthorized', { status: 401 });
const doc = await db.query.documents.findFirst({
where: eq(documents.id, params.id)
});
if (!doc) return new Response('Not found', { status: 404 });
const filePath = path.join(process.cwd(), 'uploads', doc.filePath);
const buffer = await readFile(filePath);
return new Response(buffer, {
headers: { 'Content-Type': 'application/pdf' }
});
}
Pattern 4: react-pdf Viewer Component
What: Client component using react-pdf to render PDF pages on canvas. Supports page navigation and zoom.
When to use: Document detail page (/portal/documents/[docId]).
Example:
// src/app/portal/documents/[docId]/_components/PdfViewer.tsx
'use client';
import { useState } from 'react';
import { Document, Page, pdfjs } from 'react-pdf';
import 'react-pdf/dist/Page/AnnotationLayer.css';
import 'react-pdf/dist/Page/TextLayer.css';
// Worker setup — required for Next.js
pdfjs.GlobalWorkerOptions.workerSrc = new URL(
'pdfjs-dist/build/pdf.worker.min.mjs',
import.meta.url
).toString();
export function PdfViewer({ docId }: { docId: string }) {
const [numPages, setNumPages] = useState<number>(0);
const [pageNumber, setPageNumber] = useState(1);
const [scale, setScale] = useState(1.0);
return (
<div>
<Document
file={`/api/documents/${docId}/file`}
onLoadSuccess={({ numPages }) => setNumPages(numPages)}
>
<Page pageNumber={pageNumber} scale={scale} />
</Document>
<div>
<button onClick={() => setPageNumber(p => Math.max(1, p - 1))}>Prev</button>
<span>{pageNumber} / {numPages}</span>
<button onClick={() => setPageNumber(p => Math.min(numPages, p + 1))}>Next</button>
<button onClick={() => setScale(s => s + 0.2)}>Zoom In</button>
<button onClick={() => setScale(s => Math.max(0.4, s - 0.2))}>Zoom Out</button>
</div>
</div>
);
}
Pattern 5: Forms Library Modal with Search
What: Client component modal. Fetches form templates list from /api/forms-library. Filters client-side by name as agent types. Has separate "Browse files" button for custom PDF upload.
When to use: "Add Document" button on client profile page.
Schema Additions
// Addition to src/db/schema.ts
export const formTemplates = pgTable('form_templates', {
id: serial('id').primaryKey(),
name: text('name').notNull(),
filename: text('filename').notNull().unique(), // filename in seeds/forms/
createdAt: timestamp('created_at').defaultNow(),
updatedAt: timestamp('updated_at').defaultNow(),
});
// documents table already partially defined in Phase 3 (documentStatusEnum exists)
// Add columns:
// formTemplateId: integer('form_template_id').references(() => formTemplates.id)
// filePath: text('file_path').notNull() // relative path within uploads/
// name: text('name').notNull()
Anti-Patterns to Avoid
- Serving uploads as static assets: Never put
uploads/underpublic/. All PDF serving must go through authenticated API routes. - Storing absolute paths in DB: Store relative paths only (e.g.,
clients/{id}/{uuid}.pdf). Absolute paths break when the Docker volume mount changes. - Mutating seed templates: The
seeds/forms/directory is read-only source of truth. Never write or delete files there programmatically. - Using
<iframe src="/api/documents/id/file">for the viewer: Works for DOC-03 but blocks Phase 5 canvas overlay. Use react-pdf. - Loading the PDF worker from CDN: In local/Docker environment with no internet guarantee, configure the worker from
node_modulesviaimport.meta.url.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| PDF rendering in browser | Custom canvas renderer | react-pdf (PDF.js) |
PDF rendering handles fonts, embedded images, encodings, cross-browser canvas quirks — thousands of edge cases |
| Multipart file upload parsing | Custom body parser | multer or Next.js built-in formData |
Boundary parsing, temp file buffering, size limits — complex protocol details |
| Unique filename generation | Date.now() + Math.random() |
uuid v4 |
UUID v4 provides genuine collision resistance; timestamp-based names can collide under concurrent requests |
| Path traversal prevention | String sanitization | path.join() + allowlist check |
path.join normalizes ../ sequences; add explicit check that resolved path starts with allowed base dir |
Key insight: PDF rendering looks simple (just show the bytes) but requires a complete PDF interpreter for production use. PDF.js is the only production-grade open-source PDF renderer for the browser.
Common Pitfalls
Pitfall 1: react-pdf Worker Not Configured for Next.js
What goes wrong: pdfjs.GlobalWorkerOptions.workerSrc is not set, or set to a CDN URL. PDF renders blank or throws "Setting up fake worker" warning. In Next.js with webpack, the worker import requires special handling.
Why it happens: PDF.js uses a Web Worker for parsing. Next.js webpack doesn't automatically handle .worker.mjs imports.
How to avoid: Set workerSrc using new URL('pdfjs-dist/build/pdf.worker.min.mjs', import.meta.url).toString() inside the component. This is the officially documented pattern for bundlers. Do NOT use cdnjs.cloudflare.com URL in a local/offline environment.
Warning signs: Console shows "Setting up fake worker" or PDF page renders blank.
Pitfall 2: Path Traversal in File Serving Route
What goes wrong: doc.filePath from the DB contains ../../etc/passwd. The file serving route reads it and returns sensitive data.
Why it happens: DB record was tampered with, or filePath validation was skipped.
How to avoid: After building the absolute path with path.join(), assert that it starts with the expected uploads/ base directory before reading:
const UPLOADS_BASE = path.join(process.cwd(), 'uploads');
if (!filePath.startsWith(UPLOADS_BASE)) {
return new Response('Forbidden', { status: 403 });
}
Warning signs: Any .. sequence appearing in a file path.
Pitfall 3: uploads/ Excluded from Docker Volume
What goes wrong: Files are saved to uploads/ successfully in development but disappear on container restart because uploads/ is not mounted as a Docker volume.
Why it happens: Docker containers have ephemeral filesystems by default. If the volume mount targets /app but excludes subdirectories, files don't persist.
How to avoid: Ensure docker-compose.yml mounts ./uploads:/app/uploads as a named or bind-mount volume. Validate survival of uploads across container restarts as an explicit acceptance criterion.
Warning signs: Uploaded files exist in dev but disappear after docker compose restart.
Pitfall 4: react-pdf CSS Not Imported
What goes wrong: PDF renders but text is missing or annotation links are invisible.
Why it happens: react-pdf requires explicit CSS imports for the annotation and text layers.
How to avoid: Import both CSS files in the viewer component:
import 'react-pdf/dist/Page/AnnotationLayer.css';
import 'react-pdf/dist/Page/TextLayer.css';
Pitfall 5: File Size Limits on API Route
What goes wrong: Agent uploads a large PDF via the file picker (some real estate contracts are 10-15MB). Next.js API routes have a default 4MB body size limit.
Why it happens: Next.js bodyParser limit is 4MB by default.
How to avoid: Export config from the API route to disable body parsing and use multer for streaming, or increase the limit:
export const config = { api: { bodyParser: { sizeLimit: '20mb' } } };
For App Router, use formData() with multer middleware configured for appropriate limits.
Pitfall 6: Missing seeds/forms/ Directory in Docker Build
What goes wrong: Container starts but forms library is empty because seeds/forms/ was not included in the Docker image or volume mount.
Why it happens: .dockerignore or volume mounts exclude the seeds directory.
How to avoid: Either include seeds/forms/ in the Docker image (acceptable since these are non-sensitive read-only PDFs), or mount it as a read-only volume. Document the required setup step in a project README.
Code Examples
Seed Script Run Command
# package.json scripts
"seed:forms": "npx tsx scripts/seed-forms.ts"
# Monthly re-sync: download new PDFs to seeds/forms/, re-run seed:forms
Forms Library API Route
// src/app/api/forms-library/route.ts
import { auth } from '@/auth';
import { db } from '@/db';
import { formTemplates } from '@/db/schema';
export async function GET() {
const session = await auth();
if (!session) return new Response('Unauthorized', { status: 401 });
const forms = await db.select({
id: formTemplates.id,
name: formTemplates.name,
}).from(formTemplates).orderBy(formTemplates.name);
return Response.json(forms);
}
Drizzle Migration for New Tables
npx drizzle-kit generate # generates new migration
npx drizzle-kit migrate # applies migration
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
pdf.js manual integration |
react-pdf wrapper |
~2018 | react-pdf handles worker config, React lifecycle, and canvas management |
| URE RETS data feed | RESO Web API (OData v4) | 2020 | Listing data only — no forms endpoint has ever been exposed |
| SkySlope separate product | SkySlope integrated into URE member portal | ~2022 | Forms are now SkySlope-managed; no independent URE forms API |
Deprecated/outdated:
- RETS protocol: URE killed RETS in 2020; all data feeds now via RESO Web API
@react-pdf/renderer: This is a PDF generator not a viewer — completely different use case, do not confuse
Open Questions
-
Does the vendor.utahrealestate.com/webapi expose any forms endpoints?
- What we know: The API is RESO-certified OData v4. Documented endpoints cover MLS listing resources (Property, Member, Office, Media, OpenHouse). No forms/documents endpoint is documented publicly.
- What's unclear: Whether a vendor account with elevated permissions might expose additional endpoints. The SkySlope integration is member-facing only.
- Recommendation: Proceed with manual seed directory approach (DOC-01 fallback explicitly stated in REQUIREMENTS.md). If Teressa has vendor API credentials, check the
/api/$metadataendpoint for anyFormorDocumentresource — but don't block Phase 4 on this.
-
react-pdf v9 Next.js App Router compatibility
- What we know: react-pdf v9 ships as ESM. Next.js App Router has improved ESM support but some configurations require
transpilePackages. - What's unclear: Whether
react-pdfneeds to be added tonext.config.jstranspilePackagesarray for this project's Next.js version. - Recommendation: After installing, if build errors appear referencing react-pdf ESM, add
transpilePackages: ['react-pdf', 'pdfjs-dist']tonext.config.js.
- What we know: react-pdf v9 ships as ESM. Next.js App Router has improved ESM support but some configurations require
-
How many seed forms are needed initially?
- What we know: Teressa uses forms from the SkySlope/URE portal. The exact list is not known.
- What's unclear: Whether 5 forms or 50+ forms are typical for a Utah real estate agent's workflow.
- Recommendation: Start with the most common forms (Purchase Agreement, Listing Agreement, Buyer Representation Agreement, Counter Offer, Addendum). Agent can expand the seed directory as needed.
Sources
Primary (HIGH confidence)
- vendor.utahrealestate.com/webapi/docs — Verified URE Web API is RESO OData v4, listing data only, no forms endpoint documented
- resoapi.utahrealestate.com/login — URE RESO API login page
- skyslope.utahrealestate.com — SkySlope/URE integration is member-facing portal, no public API for forms
Secondary (MEDIUM confidence)
- SkySlope press release: URE partnership — Confirms SkySlope Forms is the forms product for URE members; member benefit, not API
- react-pdf documentation and GitHub (wojtekmaj/react-pdf) — Canvas-based PDF rendering pattern, worker configuration for bundlers
- REQUIREMENTS.md Out of Scope table — Explicitly lists "utahrealestate.com forms scraping (automated credential login) — Likely violates ToS; use vendor API or manual upload instead"
Tertiary (LOW confidence)
- WebSearch results on URE GitHub organization — No forms-related repositories found; confirms no open-source forms API
Metadata
Confidence breakdown:
- Standard stack (react-pdf, multer, uuid): MEDIUM — react-pdf is well-established but Next.js App Router ESM compatibility needs verification at install time
- Architecture (seed dir, copy-on-add, authenticated serving): HIGH — standard file management patterns, no novel approaches
- SkySlope/URE API finding (no forms API): MEDIUM — based on public documentation; a vendor account might reveal more endpoints, but public evidence strongly indicates forms are member-portal-only
- Pitfalls: HIGH — react-pdf worker config, path traversal, Docker volume persistence are well-documented real-world issues
Research date: 2026-03-19 Valid until: 2026-04-19 (stable domain; re-check react-pdf major version if >30 days)