ParaCrawl's profile picture.

ParaCrawl

@ParaCrawl

Joined December 2018
Similar User
KocmiTom's profile picture. teaching LLMs multilingualism at Cohere (he/him)

@KocmiTom

EAMTee's profile picture. #EAMT (European Association for #MachineTranslation) | MT Summit 2025: Geneva, Switzerland, 23-27 June 2025 https://t.co/Rhi4bgksdC

@EAMTee

bazril's profile picture. Researcher in Informatics at University of Edinburgh. Mainly working on machine translation.

@bazril

markuseful's profile picture. Head of Google Translate Research

@markuseful

RicoSennrich's profile picture. SNSF Professor at University of Zurich

@RicoSennrich

TiedemannJoerg's profile picture.

@TiedemannJoerg

KreutzerJulia's profile picture. 🤖💬NLP researcher @CohereForAI.
Mom of 3👶🏻👶🏻👼, cellist🎶, baker🥯, outdoor enthusiast 🏞️. Views my own.

@KreutzerJulia

iwslt's profile picture. The International Conference on Spoken Language Translation & SIGSLT. 
Join us for the 22nd edition of IWSLT on 31 July-1 Aug 2025, co-located with ACL!

@iwslt

marian_nmt's profile picture. NLP. NMT. Main author of Marian NMT. Research Scientist at Microsoft Translator. Non-NLP silliness and stuff on @emjotde

@marian_nmt

_SheilaCastilho's profile picture. Assistant Professor @DcuSalis, @DCU, @AdaptCentre. Machine Translation Evaluation. Translation technologies. NLP. Views are my own.

@_SheilaCastilho

fbk_mt's profile picture. #MachineTranslation Research Unit @FBK_research.
#nlproc #deeplearning #ai

@fbk_mt

DShterionov's profile picture. Assistant Professor at @TilburgU

@DShterionov

jlibovicky's profile picture. 🇨🇿 🇪🇺 Researcher at @ufal_cuni. Working on multilingual NLP and neural machine translation. Views my own. He/him

@jlibovicky

alvations's profile picture. Code, geek, game

@alvations

psgijon's profile picture. GELEA2LT - Revista Tradumàtica
Departament de Traducció i Interpretació
Universitat Autònoma de Barcelona

@psgijon

Hunting for parallel data for Asian Languages? ParaCrawl just added 9 new bonus corpora. More info & paper by Philipp Koehn from @jhuclsp to be presented at WMT24 (#EMNLP2024): paracrawl.eu/moredata The 9 datasets, as Bonus Release: paracrawl.eu

ParaCrawl's tweet image. Hunting for parallel data for Asian Languages? ParaCrawl just added 9 new bonus corpora. More info & paper by Philipp Koehn from @jhuclsp to be presented at WMT24 (#EMNLP2024): paracrawl.eu/moredata

The 9 datasets, as Bonus Release:  paracrawl.eu

Hi there, three new Bonus ParaCrawl languages have been just released: - English- Azerbaijani - English-Tajik - English-Armenian Go to the ParaCrawl website, scroll down to Bonus Languages (Low-Resource), download your preferred version: paracrawl.eu


ParaCrawl Reposted

HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇

[1/6] After about 14 months of hard work, together with multiple people we present you with OpusTrainer and OpusCleaner! OpusCleaner is your one stop data fetching/preprocessing/cleaning pipeline, complete with GUI and designed to implicitlyvisualise your data before ...



ParaCrawl Reposted

Interested in Open and Community-Driven MT initiatives? CrowdMT is for you! 🎙️Invited speakers from Wikimedia Foundation and Apertium announced. 📜Accepted papers and abstracts announced. Time to register at events.tuni.fi/eamt23/registr… Details: hplt-project.org/events


ParaCrawl Reposted

#MT people: submission date extended for the CrowdMT workshop to present works on Open Source and Community-Driven MT: 21st April 2023! Abstracts and papers wanted! You wanted also in Tampere, for the whole #EAMT23 conference or at least for this workshop on the 15th of June!

Prompsit's tweet image. #MT people: submission date extended for the CrowdMT workshop to present works on Open Source and Community-Driven MT: 21st April 2023! 
Abstracts and papers wanted!
You wanted also in Tampere, for the whole #EAMT23 conference or at least for this workshop on the 15th of June!

A new ParaCrawl parallel corpus is available! 🌍 languages: Polish-Czech 🎒 size: 24 million sentences 🗒️ license: CC0 🎯 location: paracrawl.eu bonus section 🧐 more info: paracrawl.eu/moredata


ParaCrawl Reposted

Indeed, this is the first data release of the #Macocu effort. You will find both monolingual and bilingual (with English) corpora on ELRC-Share and CLARIN repositories and the website. Insights coming soon! Most of the code also ready for you to try it out!

Massive AND high-quality corpora for Bulgarian, Croatian, Slovene, Macedonian, Icelandic, Maltese and Turkish, collected by the #MaCoCu project, are now available in our repository! Check them out and share the word: ➡️macocu.eu ➡️clarin.si/repository/xml…

ClarinSlovenia's tweet image. Massive AND high-quality corpora for Bulgarian, Croatian, Slovene, Macedonian, Icelandic, Maltese and Turkish, collected by the #MaCoCu project, are now available in our repository! Check them out and share the word:
➡️macocu.eu
➡️clarin.si/repository/xml…


Check out MultiParacawl 9, including 36 parallel corpora for Ukrainian and a total of 705 bitexts. Thanks OPUS and @TiedemannJoerg to share this great resource! paracrawl.eu/news/item/18-m…


ParaCrawl Reposted

If you have an MT system, try bleualign (github.com/bitextor/bleua…) from @ParaCrawl . Scales to ParaCrawl-sized data.


We're back with more language resources: English-Ukrainian parallel corpus with aprox. 13M sentence pairs has been released. More info and downloads: paracrawl.eu/news/item/17-e… Please, spread the word and use it!


Done! All #ParaCrawl v9 corpora are now available at paracrawl.eu, some also on Corset corset.paracrawl.eu to further inspect or filter them and a new Bitextor is also out github.com/bitextor! Thanks to #CEF and the EU for co-funding this great project!


Summer was for work! Now #ParaCrawl v9 corpora are done and again bigger than the previous ones!🤩 Extrinsic evaluation through MT almost finished and, according to old BLEU and new COMET, the quality of the MT output improves! 🥳 We will share corpora and more results soon!🕑


Very clear TODO from #ParaCrawl's last stakeholder board meeting: we need better language identification, specially for closely-related languages and for under-resourced ones. Such a basic thing! Trying here to improve current results mixing Fastext and Hunspell, take a look👇

I've just given birth to FastSpell (in-house codenamed "El Engendrito"), a targetted language identifier, based on FastText and Hunspell: github.com/mbanon/fastspe… Give it a try and let me know your thoughts, it's open and free (as in freedom AND as in free beer) #NLP



A new version of ParaCrawl is being cooked! We are aiming at not only more bilingual but also monolingual data. And we are applying neural cleaning this time with bicleaner-ai (github.com/bitextor/bicle…). Stay tuned!🧐


Milestone reached! We just published Corset, a data selection portal to get relevant data from massive amounts of parallel data such as ParaCrawl corpora. Thanks #CEFTelecom! Users welcome! Test it here: corset.paracrawl.eu Code & docs here: github.com/paracrawl/cors…


We almost forgot to tell you, ParaCrawl 8 is out! First highlight: wow the size of it! Check yourself at paracrawl.eu/releases #ParaCrawl #crawling #parallelcorpus #CEFTelecom #MT

ParaCrawl's tweet image. We almost forgot to tell you, ParaCrawl 8 is out! 
First highlight: wow the size of it!
Check yourself at paracrawl.eu/releases

#ParaCrawl #crawling #parallelcorpus #CEFTelecom #MT

ParaCrawl Reposted

Bitextor 8 is out! Many improvements and features that will make it into next @ParaCrawl data release, including ones from Snakemake 6 by @johanneskoester Check all the changes: github.com/bitextor/bitex…


Our corpora were evaluated as part of the great effort at "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets" (🧐 arxiv.org/pdf/2103.12028…). We will keep our efforts in trying to deliver high-quality corpora out of web crawled content. ParaCrawl v8 about to come!

Key table: C = correct, X = incorrect translation, WL = wrong language, NL = not language

BlancheMinerva's tweet image. Key table: C = correct, X = incorrect translation, WL = wrong language, NL = not language


Loading...

Something went wrong.


Something went wrong.