This document provides a structure for developers working on the German Document Classifier.
- Included: Installation steps, project structure, and usage instructions.
- Deep Dive: For architecture and technical details, please see the Technical System Documentation.
Run the following commands in a Colab/Kaggle cell to clone the project and navigate into the directory:
!git clone https://github.com/ha981muk-git/german_document_classifier.git
%cd german_document_classifier!pip install uv
!uv pip install --system -r pyproject.tomlExecute the main script to start the fine-tuning process:
!python -m app.main --trainEnsure you have uv installed.
# MacOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"git clone https://github.com/ha981muk-git/german_document_classifier.git
cd german_document_classifieruv sync
# MacOS/Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activatepython -m app.main --generatepython -m app.main --preparepython -m app.main --trainAfter training, you can generate comprehensive visualizations, csv files and performance summaries without re-training.
python -m app.main --resultspython -m app.main --allThe FastAPI service wraps the trained DocumentClassifier and exposes a single /predict endpoint that powers both the web UI and any programmatic client. It accepts either a text form field (for raw strings) or a file upload (for PDFs, images, or DOCs) and routes the request to the right inference path. Because the server also mounts the static frontend under /, you only need one process to serve both the UI and the API.
Start the server :
uvicorn app.api.api:app --reload --port 8080Send free‑form text for classification:
curl -X POST http://127.0.0.1:8080/predict \
-F "model_name=bert-base-german-cased" \
-F "text=Dies ist eine deutsche Beispielrechnung."Send pdf for classification:
curl -X POST http://127.0.0.1:8080/predict \
-F "model_name=bert-base-german-cased" \
-F "file=@app/data/raw/contracts/01_Vertrag.pdf"For easier dependency management and deployment, you can build and run the entire application using Docker. This is the recommended way to run the service in production. But you need to train and get the model first for testing the model.
docker build -t german-document-classifier .docker run -p 8080:8080 german-document-classifier- ✔ Uploading PDFs, images:
classifier.predict_fileextracts text via OCR/loader logic before inference. - ✔ Text classification: Directly send German text via the form field or curl request.
- ✔ Real-time inference: The model is loaded once at startup, keeping latency low for repeated predictions.
python -m app.hyperparamsearchgerman_document_classifier/
│
├── config.yaml
├── environment.yaml
├── requirements.txt
├── optuna_studies.db
│
├── app/
│ ├── main.py
│ ├── flow.py
│ ├── hyperparamsearch.py
│ │
│ ├── api/
│ │ └── api.py
│ │
│ ├── core/
│ │ ├── evaluate.py
│ │ ├── paths.py
│ │ ├── prepare_data.py
│ │ ├── predict.py
│ │ └── train.py
│ │
│ ├── data/
│ │ └── (data files not shown)
│ │
│ ├── sampler/
│ │ ├── doc_generator.py
│ │ └── make_synthetic_data.py
│ │
│ ├── static/
│ │ ├── index.html
│ │ └── style.css
│ │
│ ├── statistics/
│ │ └── result.py
│ │
│ │
│ └── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_model_training.ipynb
│ ├── 03_evaluation.ipynb
│ ├── 04_data_extraction.ipynb
│ ├── colab.ipynb
│ └── kaggle.ipynb
│
└── README.md