Clarify benchmark task definition and labeling rubric in README

Several reviewers assumed truth_text represents full-page content. In reality, contractors labeled snippets from rendered pages: ~100 words of important_content and ~10 words of not_important_content. This needs to be explicit in the README to avoid misinterpretation of drift and scoring.

TODO:
- Improve docs https://github.com/firecrawl/scrape-evals/blob/main/datasets/README.md