So you want to build your own MT engine? (part 2)
It’s been six months since part 1 and more and more folks have started mentioning the new kid on the block – Neural MT. The good news is that you need the same resources in order to train an NMT system as you do for a statistical one (SMT), so these two posts should still be relevant. Better still (or not, depending on how you look at it), to date NMT is a genuine black box, so if you were daunted by the many parameters which you can play with when setting up an instance of the Moses statistical MT, you’ll be happy to hear that you can’t do much in the case of NMT. You just need patience, lots of linguistic resources, and processing power. Lots of it!
To resume our discussion of building your own SMT, though, after acquiring translation memories and large monolingual corpora using the pointers in Part 1 of this guide, you may also want to include a bilingual glossary in order to ensure that your SMT handles specialised terminology well. For that, you can start with your own termbase and enhance it with relevant entries from available resources (note: always make sure to read the copyright details). Such resources can include the EU terminology repository (IATE & its relatives), Microsoft’s specialised Language Portal, and many more – as an aside, if terminology is your thing, keep an eye on the social media activity of the European Parliament’s Terminology Coordination team: they really, *really* love their terminology!
Depending on which SMT system you are configuring, you may need to save your glossary in tab-delimited, UTF-8 plain text format, or .csv, or leave it as Excel or TBX (knowing how to dance with all these formats will still be good for your brain, anyway).
Now that you have: a large, in-domain translation memory; a large monolingual corpus; and also a sizeable, relevant glossary, you are all set for configuring your MT engine. It’s true, it does help to also have a set of bilingual tuning data, as well as a set of bilingual testing data (particularly when you want to train several systems with exactly the same resources, and then test them to see which ones give you better results), but most online MT engines will happily generate a tuning and a testing dataset for you, and you can download, check and re-use them in all other systems if you don’t fancy creating your own.
So which SMT system should you configure? I’d say try a few, because you’ll learn lots, occasionally feel like a computer prodigy, and you will also get to know their principles better.
MT systems you can install on your own computers:
- If you’re the brave kind, you can jump right into configuring your own Moses system. The user manual will help, but you will still need some solid IT skills to start with.
- If you’re not that brave, but you still want to gather street cred quickly, have a go at setting up Moses’ more human incarnation: Moses for Mere Mortals. Regarding MMM, you also need to know that you will need access to a Ubuntu machine (virtual or real, it’s up to you) because MM was created for Ubuntu 12 or Ubuntu 14. Also, back in November, I noticed a couple of things about setting up MMM which I would have appreciated reading about somewhere:
- make sure you create and then work within the recommended directory: /Desktop/Machine-Translation, otherwise some of the MMM scripts may fall over;
- the version numbers are 1 ahead of what was available at the time (let me know if this has been rectified): e.g. the command:
- you can also have a go at configuring Apertium, which is a comptetent MT engine which may surprise you very pleasantly for some language pairs.
MT systems which you can use/configure online:
- again, have a look at Apertium: it’s lesser-known, but really worth looking into, especially if you are interested in MT for related languages
- Microsoft Translator Hub – the attraction here is that you can essentially fine-tune and improve the already-capable Microsoft MT engines by uploading your in-domain data. If you do that, though, your MT engine will not be private anymore, so you may want to stick only to using your own data and seeing how good your personal Microsoft MT engine is
- If you love analytics, visual dashboards, and the like, you will want to give KantanMT a whirl, too. Among the very useful things you will be able to do after (painlessly) configuring your MT engine will be to create a list of collaborators and assign post-editing tasks to them in order to see how good your MT engine output really is (of course, all MT engines will give you BLEU scores, but BLEU never tells the full story. BTW, if you want to read the full paper on BLEU, here it is. What you will also find useful is KantanMT’s quick explanation of what BLEU scores mean for their system.)
- Another very user-friendly online customised MT engine option is Tilde. The company’s specialists have already built a number of engines you can test using openly-available resources, so you may find that this saves you part of the job of uploading TMs, glossaries or corpora in order to build your customised system.
Whatever you do, whichever system you choose, keeping an eye on the regular TAUS reports, webinars and YouTube videos will help you stay up to date with recent advances in MT – and, contrary to what some folks like to say, things don’t really advance that quickly. What’s more, you will also pretty quickly become wise to the amount of hot air produced by some researchers or companies around automatic MT Quality Estimation. You will start appreciating that quality source texts (correct grammar and lexis, lack of ambiguity, short sentences) are much more important than you may have thought before. You will also appreciate more that not many folks really know when to use MT, because solid source content evaluation methods in order to determine their suitability for MT still needs more practical results – so we’re still looking at a combination of “we’ve always used MT for these domains” + “human post-editors don’t complain that much about editing MT output in these domains”; things will not stay like this for long, though, as developers such as KantanMT have been trialling MT confidence scores similar to fuzzy match scores for a while now, so while the academic world may continue to debate approaches for years to come, the industry will offer practical implementations which will become increasingly accurate through use and feedback.
All in all, I see MT as a useful developing technology in itself, but also as a very welcome reason for linguists to come out of our comfort zone and re-evaluate our role, services, and value in general. I hope that the few pointers above have shown that there is no real “us” versus “them” side to the MT story, but rather an opportunity for all of us to learn something new about our fellow linguists, technologists, and even language services buyers, as well as push our brains a little and solve a few science problems.