A client-side web application that parses PDF files from the Brazilian Official Gazette (Diário Oficial da União - DOU) and extracts legal acts into a structured, searchable table.
- Client-side PDF Processing: All parsing happens in the browser using pdfjs-dist
- Smart Document Detection: Automatically identifies different types of legal acts (Portarias, Decretos, etc.)
- Ministry/Organization Tracking: Associates documents with their issuing ministry or organization
- Advanced Filtering: Search across all fields, filter by document type or ministry
- Sortable Table: Click column headers to sort documents
- CSV Export: Export filtered results to CSV format
- Detail Modal: View full document content with a single click
- Responsive Design: Works on desktop, tablet, and mobile devices
- Brazilian Government Aesthetic: Clean, professional UI with government color scheme
- Drag and drop a DOU PDF file onto the upload zone, or click to select a file
- The application will process the PDF and extract all text
- Progress indicator shows processing status
The first pages of DOU PDFs contain a table of contents (Sumário). The parser:
- Detects pages with more than 3 lines containing 10+ consecutive dots
- Automatically skips these summary pages
- Removes any lines with 5+ consecutive dots from content
Recognizes the following document types:
- PORTARIA (Nº XXX)
- DECRETO (Nº XXX)
- RESOLUÇÃO (Nº XXX)
- LEI (Nº XXX)
- EDITAL (Nº XXX)
- DESPACHO
- INSTRUÇÃO NORMATIVA (Nº XXX)
- MEDIDA PROVISÓRIA (Nº XXX)
- ATO (Nº XXX)
- ORDEM (Nº XXX)
Automatically identifies and tracks:
- MINISTÉRIO DA/DO [NAME]
- PRESIDÊNCIA DA REPÚBLICA
- CASA CIVIL
- ADVOCACIA-GERAL DA UNIÃO
- CONTROLADORIA-GERAL DA UNIÃO
- AGÊNCIA [NAME]
- INSTITUTO [NAME]
- CONSELHO [NAME]
- SECRETARIA [NAME]
- TRIBUNAL [NAME]
- COMANDO [NAME]
Extracts dates in Portuguese format: "DD DE [MONTH] DE YYYY"
- Example: "19 DE DEZEMBRO DE 2025" → "19/12/2025"
Identifies signatories by:
- Looking for ALL CAPS names
- Checking the last ~10 lines of each document
- Filtering out common headers and footer text
- Requiring reasonable name length (5-100 characters)
Simply open index.html in a modern web browser. No server or installation required.
- Upload a DOU PDF file
- Wait for processing (progress bar will show status)
- View extracted documents in the table
- Search: Type in the search box to filter across all fields
- Type Filter: Select a specific document type from the dropdown
- Ministry Filter: Select a specific ministry/organization
Click any column header to sort by that column. Click again to reverse sort direction.
Click the "Exportar CSV" button to download filtered results as a CSV file.
Click any row in the table to open a modal with the complete document content.
- HTML5: Semantic markup
- CSS3: Modern styling with flexbox and grid
- Vanilla JavaScript: No framework dependencies
- pdfjs-dist: PDF text extraction library (loaded from CDN)
dou-parser/
├── index.html # Main HTML structure
├── styles.css # All styling and responsive design
├── app.js # UI logic, filtering, sorting, export
├── parser.js # PDF parsing and document extraction
└── README.md # This file
Works in all modern browsers that support:
- ES6+ JavaScript
- CSS Grid and Flexbox
- Fetch API
- File API
Tested in:
- Chrome 90+
- Firefox 88+
- Safari 14+
- Edge 90+
The application uses the Brazilian government color scheme:
- Primary (Navy Blue):
#003366- Headers, buttons - Accent (Yellow):
#FFCC00- Export button, highlights - Background:
#f5f5f5- Page background - Text:
#333- Main text color - White:
#ffffff- Cards, table background
- ✅ Correctly skips Sumário pages (detects dots pattern)
- ✅ Extracts documents from content pages only
- ✅ Associates correct ministry with each document
- ✅ Clean data without dots artifacts
- ✅ Functional search and filters
- ✅ Working CSV export
- ✅ Professional, clean UI
- ✅ Sortable table columns
- ✅ Document detail modal
- ✅ Responsive design
- ✅ Progress indicator
For a document like:
CASA CIVIL
PORTARIAS DE 19 DE DEZEMBRO DE 2025
O MINISTRO DE ESTADO DA CASA CIVIL...
Nº 1.431 - NOMEAR
ALLAN DE ALCÂNTARA, para exercer o cargo...
FLAVIO JOSÉ ROMAN
The parser extracts:
| Tipo | Número | Data | Ministério | Página | Prévia | Signatário |
|---|---|---|---|---|---|---|
| PORTARIA | 1.431 | 19/12/2025 | CASA CIVIL | 2 | NOMEAR ALLAN DE ALCÂNTARA... | FLAVIO JOSÉ ROMAN |
- Inconsistent Formatting: DOU PDFs have varying formats across different sections
- Text Extraction: PDF.js provides basic text extraction; line breaks aren't always reliable
- Ministry Tracking: Ministries are declared at section headers and apply to subsequent documents
- Signatory Detection: Heuristic-based approach looking for ALL CAPS names at document end
- Support for batch PDF processing
- More document types
- Better handling of multi-page documents
- Export to other formats (JSON, Excel)
- Document content search highlighting
- Save/load filtered results
- API integration for automated processing
This is a demonstration project created for parsing DOU documents. Use at your own discretion.
For issues or questions, please check the console for error messages and verify that you're using a valid DOU PDF file.