FineWeb-C: A Community-Built Dataset For Improving Language Models...

een koppeling hebt gedeeld

2024-12-25 08:23:46 -

WWW.MARKTECHPOST.COM

FineWeb-C: A Community-Built Dataset For Improving Language Models In ALL Languages

FineWeb2 significantly advances multilingual pretraining datasets, covering over 1000 languages with high-quality data. The dataset uses approximately 8 terabytes of compressed text data and contains nearly 3 trillion words, sourced from 96 CommonCrawl snapshots between 2013 and 2024. Processed using the datatrove library, FineWeb2 demonstrates superior performance compared to established datasets like CC-100, mC4, CulturaX, and HPLT across nine diverse languages. The ablation and evaluation setup is present in this github repo.Huggingface community researchers introduced FineWeb-C, a collaborative, community-driven project that expands upon FineWeb2 to create high-quality educational content annotations across hundreds of languages. The project enables community members to rate web contents educational value and identify problematic elements through the Argilla platform. Languages achieving 1,000 annotations qualify for dataset inclusion. This annotation process serves dual purposes: identifying high-quality educational content and improving LLM development across all languages.318 Hugging Face community members have submitted 32,863 annotations, contributing to developing high-quality LLMs across underrepresented languages. FineWeb-Edu is a dataset built upon the original FineWeb dataset and employs an educational quality classifier trained on LLama3-70B-Instruct annotations to identify and retain the most educational content. This approach has proven successful, outperforming FineWeb on popular benchmarks while reducing the data volume needed for training effective LLMs. The project aims to extend FineWeb-Edus capabilities to all world languages by collecting community annotations to train language-specific educational quality classifiers.The project prioritizes human-generated annotations over LLM-based ones, particularly for low-resource languages where LLM performance cannot be reliably validated. This community-driven approach parallels Wikipedias collaborative model, emphasizing open access and democratization of AI technology. Contributors join a broader movement to break language barriers in AI development, as commercial companies typically focus on profitable languages. The datasets open nature enables anyone to build AI systems tailored to specific community needs while facilitating learning about effective approaches across different languages.The FineWeb-Edu uses multiple annotations per page for some languages, allowing flexible calculation of annotator agreement. Quality control measures include plans to increase annotation overlap in heavily annotated languages. The data contains a boolean column problematic_content_label_present to identify pages with problematic content flags, often resulting from incorrect language detection. Users can filter content based on either individual problematic labels or annotator agreement through the problematic_content_label_agreement column. The dataset operates under the ODC-By v1.0 license and CommonCrawls Terms of Use.In conclusion, FineWeb2s community-driven extension, FineWeb-C, has gathered 32,863 annotations from 318 contributors, focusing on educational content labeling. The project demonstrates superior performance compared to existing datasets with less training data through FineWeb-Edus specialized educational content classifier. Unlike commercial approaches, this open-source initiative prioritizes human annotations over LLM-based ones, particularly for low-resource languages. The dataset features robust quality control measures, including multiple annotation layers and problematic content filtering, while operating under the ODC-By v1.0 license.Check out the details. All credit for this research goes to the researchers of this project. Also,dont forget to follow us onTwitter and join ourTelegram Channel andLinkedIn Group. Dont Forget to join our60k+ ML SubReddit. Sajjad Ansari+ postsSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner. [Download] Evaluation of Large Language Model Vulnerabilities Report (Promoted)

0 Reacties 0 aandelen 157 Views