📘 German Document Classifier

This document provides a structure for developers working on the German Document Classifier.

Included: Installation steps, project structure, and usage instructions.
Deep Dive: For architecture and technical details, please see the Technical System Documentation.

0.0 🚀 Getting Started

1.0 Running in Google Colab / Kaggle

Clone the Repository

Run the following commands in a Colab/Kaggle cell to clone the project and navigate into the directory:

!git clone https://github.com/ha981muk-git/german_document_classifier.git
%cd german_document_classifier

Install Dependencies

!pip install uv
!uv pip install --system -r pyproject.toml

Fine-Tune the BERT Model

Execute the main script to start the fine-tuning process:

!python -m app.main --train

2.0 🛠️ Installation ( Local Development)

Prerequisites

Before Cloning the Repository

Ensure you have uv installed.

# MacOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

2.1 Clone the repository

git clone https://github.com/ha981muk-git/german_document_classifier.git
cd german_document_classifier

2.2 Create virtual environment

uv sync

# MacOS/Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

2.3 Synthetic Data Generation (Optional)

python -m  app.main --generate

2.4 Prepare CSV File For Datasets Training (Optional)

python -m app.main --prepare

2.5 Training the BERT Models

python -m app.main --train

2.6 Generating Evaluation Results

After training, you can generate comprehensive visualizations, csv files and performance summaries without re-training.

python -m app.main --results

2.7 Alternatively Generate, Prepare, Training the BERT Models and Evaluation Results (All At Once)

python -m app.main --all

2.8 FastAPI Web Server

The FastAPI service wraps the trained DocumentClassifier and exposes a single /predict endpoint that powers both the web UI and any programmatic client. It accepts either a text form field (for raw strings) or a file upload (for PDFs, images, or DOCs) and routes the request to the right inference path. Because the server also mounts the static frontend under /, you only need one process to serve both the UI and the API.

Start the server :

uvicorn app.api.api:app --reload --port 8080

Send free‑form text for classification:

curl -X POST http://127.0.0.1:8080/predict \
     -F "model_name=bert-base-german-cased" \
     -F "text=Dies ist eine deutsche Beispielrechnung."

Send pdf for classification:

curl -X POST http://127.0.0.1:8080/predict \
     -F "model_name=bert-base-german-cased" \
     -F "file=@app/data/raw/contracts/01_Vertrag.pdf"

Open the UI:

👉 http://localhost:8080

2.9 🐳 Running with Docker (Alternative)

For easier dependency management and deployment, you can build and run the entire application using Docker. This is the recommended way to run the service in production. But you need to train and get the model first for testing the model.

Build the Docker Image

docker build -t german-document-classifier .

Run the Docker Container

docker run -p 8080:8080 german-document-classifier

Allows:

✔ Uploading PDFs, images: classifier.predict_file extracts text via OCR/loader logic before inference.
✔ Text classification: Directly send German text via the form field or curl request.
✔ Real-time inference: The model is loaded once at startup, keeping latency low for repeated predictions.

2.9 Hyperparameter Searching

python -m app.hyperparamsearch

3. 📁 Project Structure

german_document_classifier/
│
├── config.yaml
├── environment.yaml
├── requirements.txt
├── optuna_studies.db
│
├── app/
│   ├── main.py
│   ├── flow.py
│   ├── hyperparamsearch.py
│   │
│   ├── api/
│   │   └── api.py
│   │
│   ├── core/
│   │   ├── evaluate.py
│   │   ├── paths.py
│   │   ├── prepare_data.py
│   │   ├── predict.py
│   │   └── train.py
│   │
│   ├── data/
│   │   └── (data files not shown)
│   │
│   ├── sampler/
│   │   ├── doc_generator.py
│   │   └── make_synthetic_data.py
│   │
│   ├── static/
│   │   ├── index.html
│   │   └── style.css
│   │
│   ├── statistics/
│   │   └── result.py
│   │   
│   │
│   └── notebooks/
│       ├── 01_data_exploration.ipynb
│       ├── 02_model_training.ipynb
│       ├── 03_evaluation.ipynb
│       ├── 04_data_extraction.ipynb
│       ├── colab.ipynb
│       └── kaggle.ipynb
│
└── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
.vscode		.vscode
app		app
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
APPENDIX.md		APPENDIX.md
Dockerfile		Dockerfile
README.md		README.md
config.yaml		config.yaml
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📘 German Document Classifier

0.0 🚀 Getting Started

1.0 Running in Google Colab / Kaggle

Clone the Repository

Install Dependencies

Fine-Tune the BERT Model

2.0 🛠️ Installation ( Local Development)

Prerequisites

Before Cloning the Repository

2.1 Clone the repository

2.2 Create virtual environment

2.3 Synthetic Data Generation (Optional)

2.4 Prepare CSV File For Datasets Training (Optional)

2.5 Training the BERT Models

2.6 Generating Evaluation Results

2.7 Alternatively Generate, Prepare, Training the BERT Models and Evaluation Results (All At Once)

2.8 FastAPI Web Server

Open the UI:

2.9 🐳 Running with Docker (Alternative)

Build the Docker Image

Run the Docker Container

Allows:

2.9 Hyperparameter Searching

3. 📁 Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ha981muk-git/german_document_classifier

Folders and files

Latest commit

History

Repository files navigation

📘 German Document Classifier

0.0 🚀 Getting Started

1.0 Running in Google Colab / Kaggle

Clone the Repository

Install Dependencies

Fine-Tune the BERT Model

2.0 🛠️ Installation ( Local Development)

Prerequisites

Before Cloning the Repository

2.1 Clone the repository

2.2 Create virtual environment

2.3 Synthetic Data Generation (Optional)

2.4 Prepare CSV File For Datasets Training (Optional)

2.5 Training the BERT Models

2.6 Generating Evaluation Results

2.7 Alternatively Generate, Prepare, Training the BERT Models and Evaluation Results (All At Once)

2.8 FastAPI Web Server

Open the UI:

2.9 🐳 Running with Docker (Alternative)

Build the Docker Image

Run the Docker Container

Allows:

2.9 Hyperparameter Searching

3. 📁 Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages