Similar User
@KocmiTom
@EAMTee
@bazril
@markuseful
@RicoSennrich
@TiedemannJoerg
@KreutzerJulia
@iwslt
@marian_nmt
@_SheilaCastilho
@fbk_mt
@DShterionov
@jlibovicky
@alvations
@psgijon
Hunting for parallel data for Asian Languages? ParaCrawl just added 9 new bonus corpora. More info & paper by Philipp Koehn from @jhuclsp to be presented at WMT24 (#EMNLP2024): paracrawl.eu/moredata The 9 datasets, as Bonus Release: paracrawl.eu
Hi there, three new Bonus ParaCrawl languages have been just released: - English- Azerbaijani - English-Tajik - English-Armenian Go to the ParaCrawl website, scroll down to Bonus Languages (Low-Resource), download your preferred version: paracrawl.eu
HPLT News and Tools!!! If you are interested in filtering your datasets for quality and using them to train MT and LLMs, you are interested in this thread 👇
[1/6] After about 14 months of hard work, together with multiple people we present you with OpusTrainer and OpusCleaner! OpusCleaner is your one stop data fetching/preprocessing/cleaning pipeline, complete with GUI and designed to implicitlyvisualise your data before ...
Interested in Open and Community-Driven MT initiatives? CrowdMT is for you! 🎙️Invited speakers from Wikimedia Foundation and Apertium announced. 📜Accepted papers and abstracts announced. Time to register at events.tuni.fi/eamt23/registr… Details: hplt-project.org/events
#MT people: submission date extended for the CrowdMT workshop to present works on Open Source and Community-Driven MT: 21st April 2023! Abstracts and papers wanted! You wanted also in Tampere, for the whole #EAMT23 conference or at least for this workshop on the 15th of June!
A new ParaCrawl parallel corpus is available! 🌍 languages: Polish-Czech 🎒 size: 24 million sentences 🗒️ license: CC0 🎯 location: paracrawl.eu bonus section 🧐 more info: paracrawl.eu/moredata
Indeed, this is the first data release of the #Macocu effort. You will find both monolingual and bilingual (with English) corpora on ELRC-Share and CLARIN repositories and the website. Insights coming soon! Most of the code also ready for you to try it out!
Massive AND high-quality corpora for Bulgarian, Croatian, Slovene, Macedonian, Icelandic, Maltese and Turkish, collected by the #MaCoCu project, are now available in our repository! Check them out and share the word: ➡️macocu.eu ➡️clarin.si/repository/xml…
Check out MultiParacawl 9, including 36 parallel corpora for Ukrainian and a total of 705 bitexts. Thanks OPUS and @TiedemannJoerg to share this great resource! paracrawl.eu/news/item/18-m…
If you have an MT system, try bleualign (github.com/bitextor/bleua…) from @ParaCrawl . Scales to ParaCrawl-sized data.
We're back with more language resources: English-Ukrainian parallel corpus with aprox. 13M sentence pairs has been released. More info and downloads: paracrawl.eu/news/item/17-e… Please, spread the word and use it!
Done! All #ParaCrawl v9 corpora are now available at paracrawl.eu, some also on Corset corset.paracrawl.eu to further inspect or filter them and a new Bitextor is also out github.com/bitextor! Thanks to #CEF and the EU for co-funding this great project!
Summer was for work! Now #ParaCrawl v9 corpora are done and again bigger than the previous ones!🤩 Extrinsic evaluation through MT almost finished and, according to old BLEU and new COMET, the quality of the MT output improves! 🥳 We will share corpora and more results soon!🕑
Very clear TODO from #ParaCrawl's last stakeholder board meeting: we need better language identification, specially for closely-related languages and for under-resourced ones. Such a basic thing! Trying here to improve current results mixing Fastext and Hunspell, take a look👇
I've just given birth to FastSpell (in-house codenamed "El Engendrito"), a targetted language identifier, based on FastText and Hunspell: github.com/mbanon/fastspe… Give it a try and let me know your thoughts, it's open and free (as in freedom AND as in free beer) #NLP
A new version of ParaCrawl is being cooked! We are aiming at not only more bilingual but also monolingual data. And we are applying neural cleaning this time with bicleaner-ai (github.com/bitextor/bicle…). Stay tuned!🧐
Milestone reached! We just published Corset, a data selection portal to get relevant data from massive amounts of parallel data such as ParaCrawl corpora. Thanks #CEFTelecom! Users welcome! Test it here: corset.paracrawl.eu Code & docs here: github.com/paracrawl/cors…
We almost forgot to tell you, ParaCrawl 8 is out! First highlight: wow the size of it! Check yourself at paracrawl.eu/releases #ParaCrawl #crawling #parallelcorpus #CEFTelecom #MT
Bitextor 8 is out! Many improvements and features that will make it into next @ParaCrawl data release, including ones from Snakemake 6 by @johanneskoester Check all the changes: github.com/bitextor/bitex…
Our corpora were evaluated as part of the great effort at "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets" (🧐 arxiv.org/pdf/2103.12028…). We will keep our efforts in trying to deliver high-quality corpora out of web crawled content. ParaCrawl v8 about to come!
Key table: C = correct, X = incorrect translation, WL = wrong language, NL = not language
United States Trends
- 1. Josh Allen 17,5 B posts
- 2. Steelers 44,5 B posts
- 3. Lions 66,4 B posts
- 4. Lions 66,4 B posts
- 5. Eagles 52,8 B posts
- 6. Jalen 25,1 B posts
- 7. TJ Watt 3.226 posts
- 8. #OnePride 7.848 posts
- 9. Tyler Bass 1.054 posts
- 10. Jets 37,2 B posts
- 11. #HereWeGo 6.195 posts
- 12. Mahomes 23,8 B posts
- 13. Dan Skipper 2.891 posts
- 14. Broncos 17,5 B posts
- 15. #BUFvsDET 8.281 posts
- 16. Dolphins 29,4 B posts
- 17. Tony Romo N/A
- 18. Colts 12,7 B posts
- 19. Tim Patrick 1.534 posts
- 20. Ty Johnson 1.191 posts
Who to follow
-
Tom Kocmi
@KocmiTom -
EAMT
@EAMTee -
Barry Haddow
@bazril -
Markus Freitag
@markuseful -
Rico Sennrich
@RicoSennrich -
Jörg Tiedemann
@TiedemannJoerg -
Julia Kreutzer
@KreutzerJulia -
IWSLT
@iwslt -
Marcin Junczys-Dowmunt (Marian NMT)
@marian_nmt -
Dr. Sheila Castilho
@_SheilaCastilho -
MT Group at FBK
@fbk_mt -
Dimitar Shterionov
@DShterionov -
Jindřich Libovický
@jlibovicky -
Liling Tan
@alvations -
Pilar Sanchez Gijon
@psgijon
Something went wrong.
Something went wrong.