Published December 20, 2023 | Version 1.0.0
Dataset Open

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification

  • 1. ROR icon Czech Technical University in Prague
  • 2. ROR icon École Polytechnique Fédérale de Lausanne

Description

Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

Key Features:

  • LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
  • Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
  • Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

Dataset Composition:

  • curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
  • curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

Intended Use:

  • Fine-tuning and advancing Homepage2Vec or similar website classification models
  • Research on LLM-generated datasets for text classification tasks
  • Exploration of multilingual website classification

Additional Information:

  • Project and report repository: https://github.com/CS-433/ml-project-2-mlp

Acknowledgments:

This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

Files

curlie-gpt3.5-10k.csv

Files (1.0 MB)

Name Size Download all
md5:40fa7530cf7ca386423200d06ea54d23
501.9 kB Preview Download
md5:dd10e6b2a1c9b42fb8a76cf092f7a795
501.9 kB Preview Download

Additional details

Related works

Is supplement to
Project deliverable: https://github.com/CS-433/ml-project-2-mlp (URL)