So you want to build your own MT engine? (part 1)

So you want to build your own MT engine? (part 1)

The 11th of November, 2016, was a super fun day for me: I left Leeds nice and early to visit an area of London I’d never seen before and join STP’s (Sandberg Translation Partners) Anna Norek, CIoL’s (Chartered Institute of Linguists) Nigel Goffe and just under 40 CIoL members at Machine Translation: What does it mean to the working translator & how can we use it to our advantage?

My contribution was a practical demonstration of what it’s like to set up a Machine Translation (MT) engine. I thought of all the things I wished someone had told me when I first started working with MT, so I went through the whole process, from gathering all the necessary resources to using a few systems, comparing and evaluating their performance, and seeing what else we could do both with the resources and the systems themselves.

The two hours I had turned out to be really short and, although I think pretty much everyone who came got excited about at least one thing, the participants also asked me for the link to my Google Doc with the MT story I’d told them. Given that I’d volunteered to do this presentation for them, I thought others in the professional linguists’ community may also find the following pointers useful, too, as well as be willing to get in touch and let me know other aspects of MT configuration and use which I should have covered.

Chapter 1: Why would you want an MT system in the first place?

From our discussions on the day, some of the freelancers were quite frustrated with some Language Service Providers’ (LSP) misuse of MT for the sole purpose of driving down rates. Consequently, in at least a few cases, configuring one’s own MT engine was seen as a way to get back at some LSPs and the ever-decreasing rates they offer.

This was quite interesting, as freelancers preferring to use their own MT engine instead of feeding back to the LSP on how the LSP’s one could be improved signals a clearly unpleasant and unproductive relationship. It also makes the whole process rather more complicated, as it is surely going to invalidate pretty much all of the LSP’s tracking information (such as edit distance measures or the tracking of the time the freelancer had spent working on individual segments). It’s a bit of a guerrilla move in fact; it says “Dear LSPs, please stop messing professional linguists about. They are creative people and they have ways to mess with your data, as well as statistics. Stop passing Post-Editing Machine Translation (PEMT) jobs as human revision ones and stop using your end-clients’ supposed ignorance or poverty as the reason you’re doing it.”

As an aside, I also find this whole tense conversation a bit unhelpful, because it just signals to me a certain stubbornness to fully understand the other’s situation. As an LSP, thanks to the vast number of jobs handled every year, you have the opportunity to acquire extremely large TMs (this is something which freelancers all wished they had, but by not wanting to share resources, they will never have, I’m afraid). What would you do if you had this treasure chest of knowledge at your disposal? Of course you’d like to make the most of it! What options do you have? Not more than three and unfortunately only the least unrealistic one tends to make the vocal freelancers happy:

  1. send your full TMs to your freelancers so that they can use the Assemble from Portions / Fragment Assembly functionalities in their CAT tools in addition to the fuzzy matches. I am pretty sure everyone agrees that this is not a realistic expectation.
  2. customise and deploy your own online translation environment (as quite a few large LSPs did around 2010 and as the EU is doing starting with January, 2017 with their Cat4Trad tool. Back in 2010 freelancers protested and boycotted this approach because they could no longer use and add to their own TMs – the whole translation process happened online, after all – so they felt under constant surveillance (which they were), underpaid and unable to improve their own linguistic resources (quite rightly, too, if you also consider the following simple question: If you want to sell your translation business and retire to a sunny place, what do you actually have to sell if you’ve only worked on your clients’ online platforms?)
  3. use your TMs to develop customised MT engines and ask your freelancers to post-edit the results if they judge that the output is actually editable.

Blaming LSPs for wanting to use their TMs in order to build customised MT engines is a bit disingenuous to me given that quite a fair amount of freelancers would do exactly the same thing if only they had the linguistic resources. It also misses the point that, if any freelancer is unhappy with post-editing or with working for LSPs altogether, the industry has not run out of direct clients – they just need to be won over. But I digress…

In any case, before even looking at what an MT engine needs, we started talking about why the freelancers were considering customising their own engines in the first place.

  • Some said consistency, so then we talked the topic of and developments in sub-segment matching, which Deja Vu X had been offering for a very long time through its “portions” features, which had been a feature in memoQ for a few versions, too, and which SDL Studio has introduced in its 2017 version.If you’re not familiar with sub-segment matching, think of it this way: traditional Translation Memory (TM) systems take entire source segments (sentences most often, but not only – the segmentation rules can be changed to suit the user’s needs) and look for similar source segments in the Translation Memory.If they find anything above a pre-defined threshold (around the 70-75% mark in most Computer-Assisted Translation (CAT) tools), they will show the user the target segment stored in the TM, as well as the difference between the new source segment and the source segment stored in the TM.If the similarity score is below the pre-defined threshold, then the only other source of useful suggestions has traditionally been the termbase, but that only applies to individual terms, not the whole segment.
    With sub-segment matching, though, the source segment is broken up into smaller parts and the CAT tool uses statistical algorithms on the Translation Memory to look for the smaller parts’ existence, and then guess their translation. If the source sub-segments do not exist at all in the TM or exist in a statistically-insignificant proportion, then they will be copied as they are into the generated target segment – exactly as you see in some Machine Translated segments where certain source terms are not in the vocabulary, and therefore are copied as they are into the proposed target segment.What’s the point of this long discussion? All in all, it’s good news: if you want consistency, you work with a CAT tool which offers sub-segment matches, you have large TMs and termbases, and a powerful computer, you may not need to build your own MT engine at all because you’ll get the sub-segment suggestions and you’ll be free to arrange them in whichever way your target segment requires.
  • Some said speed, and then we talked about and did a little demo of two Automatic Speech Recognition tools: Dragon Naturally Speaking and Google Voice Dictation.Although its desktop version only supports a handful of languages, Dragon still remains the preferred dictation tool of many professional subtitlers and translators alike, offering excellent on-the-fly correction functionalities and working really well out of the box (English is my second language and, still, whenever I need to write something longer than a few lines I simply reach out for my microphone; guess what I am using to write this blog post). Having said that, make sure that your CAT tool supports all the features of Dragon (historically, for instance, the dynamic correction functionality in Dragon was not available when translating with SDL Studio).Google Voice Dictation also performed well in our little test, its obvious advantage being the much larger number of languages supported. The workshop participants also liked its availability not only in Google Docs, but also in cloud-based tools such as Matecat.
    Google Voice in Google Docs

    Google Voice typing in Google Docs

    In addition to these tools, we also talked about the built-in speech recognition functionalities of everyone’s Operating System. Personally, one of the advantages of having a MacBook on which Windows (version 7 in my case) and all the Windows-based CAT tools are just guests thanks to using Parallels is the fact that I can use at the same time Dragon for English alongside the Romanian and French recognition systems built into iOS so that I do not have to keep changing my keyboard layouts and then remember where to find the accented characters.

  • When talking about the benefits and pitfalls of building and using one’s own MT engine, we also spoke of research indicating that MT post-editing results in target texts which have the lowest degree of lexical variety. A little bit better in this respect are the target texts which are produced by editing TM fuzzy matches. The best, however, in terms of lexical variety are still translations done from scratch.It is therefore important to keep in mind that if the translation brief requires that the translation be creative and lexically-varied, using an MT engine is likely to speed you up producing the first draft, but slow you down in the long term because you will then have to go back to the beginning and think of ways to make the repetitive and often literal MT output match the brief.
  • I also asked the rather obvious question: how much data did the individual translators in the room actually have in order to train an MT engine from scratch? When you are talking millions of translated segments, more millions of authentic target language sentences, and high-quality glossaries, not many freelancers are in a position to have enough data in order to configure their own MT engines. Unless they cooperate and share resources, perhaps a more sensible approach would be to ensure the quality of their translation memories and termbases, become familiar with using speech recognition in their work, and also work with a CAT tool which supports sub-segment matching.It is therefore no surprise that LSPs and large end-clients are way ahead of freelance translators in the use of MT: they have much bigger quantities of data – both bilingual and monolingual – available to them, and they are also much more willing to share data with each other. Unless freelancers also get over their fears of sharing data and empowering other freelancers who are also there are direct competitors, their efforts in the MT arena might not prove all that beneficial…
  • Finally, I was curious to see how many participants were getting genuinely excited when thinking about the prospect of hacking scripts, using Linux and going back to the command line (all of which would be necessary to some extent in order to configure Moses with all its features according to the very comprehensive user manual available here). I think it is fair to say that the majority of participants were not sold on my enthusiastic predictions of how beneficial learning to code will make all of them feel…

Chapter 2: What do you need in order to build your own MT engine?

Given that the discussions we had had so far had not put anyone off the main idea of the workshop, we started looking at what is needed for the training of an MT engine and how we could go about sourcing it if we did not have it already.

1. First of all, any statistical machine translation (SMT) or neural machine translation (NMT) system needs to learn how to translate. For this it needs a lot of (some MT providers suggest over 1 million) pairs of source and target segments for the SMT engine to analyse, break down into smaller parts and guess their translations in the target language (there they are, the sub-segment are never too far, except that in SMT lingo they are called n-grams and stands for a number – so we generally talk about 1-grams, 2-grams, 3-grams and 4-grams in SMT to refer to sub-segments made up of 1, 2, 3, or 4 words. The statistical identification of the source language n-grams and the statistical guessing of their translation the target language is used to generate a translation model.

The question now is where can anyone get over 1 million pairs of source and target segments if they do not share resources? One possible source to start you off in your experiments comparing various MT providers could be the EU DGT’s repository of translation memories dating back to 2004 (provided your translation memories are also in the domains in which the DGT TMs are).

Generating bilingual TMX files in the languages in which you work takes a tiny bit of command-line fun, but the great news is that the DGT has included all the necessary instructions on its website mentioned above. Add a Terminal, Java, a projector and a great photographer (and even better translator + lecturer) like @tradutoto and you get the photo below.

Working with the DGT TMs

Working with the DGT TMs (courtesy of @tradutoto)

2. Secondly, any SMT/NMT engine also needs to learn how the target language is used. For this it needs a separate, large monolingual corpus (a corpus is a collection of texts) of authentic material in the target domain and also the target language (not the source language!). I hear you ask: “How large is ‘large’?” Tilde, for instance, one of the current leading MT providers, advises at least 5 million sentences.

If you don’t have 5 million words relaxing in a corner of your hard-drive, all is not lost: you could use tools such as BootCaT on your machines or the Sketch Engine in the cloud in order to gather such a corpus. BootCaT uses the Microsoft Bing search API which you need to sign up for and then link to BootCaT, while the SketchEngine works with the Google Search API. Both need a list of key terms and phrases (which both call “seeds”) for the language and domain in which you need to build the corpus. When using the same 10 seeds with the same settings, both in the case of English (my target language for the CIoL demo) and French (my source language), the SketchEngine (SE) gathered over 60% more words than BootCaT.

Corpora in the SketchEngine

Corpora in the SketchEngine

Of course, in the case of corpora it’s not always about quantity. The quality is also extremely important and both BootCaT and the SketchEngine presents the user with a list of website which match the combinations of seed words (called ‘tuples’) and which will be used to extract content. Should some of those web addresses ring any alarm bells, they can be deselected and thus the quality of the texts making up the corpus improved.

Why did I gather a source-language corpus when the MT engine only needs a target-language one? Because I am a great believer in creating resources for multiple uses (after all, if we’re taking the time to learn all these new tools, we may as well use them more than once). Having the corpora available to query in the SketchEngine (to look for authentic contexts in which certain terms appear, to compare apparent synonyms, to generate a thesaurus for a term I want to understand better, etc.) is extremely valuable when translating or acquiring knowledge in a new domain. Moreover, memoQ is one CAT tool which also allows the use of corpora, and I wanted to test how it handled a combined 9.5 million words of English and French as two separate monolingual texts in one LiveDocs corpus linked to my French-English translation project. I had to re-import both texts in order to make sure UTF-8 was the chosen encoding, but otherwise it did not complain at all.

I guess that’s a fair bit to think about and start preparing (and I also hope I haven’t offended anyone with some of the comments above; the conversation needs to be had and sustainable ways of supporting both freelancers and LSPs need to be found).

In the next part I’ll share some of the suggestions and ideas we had in the CIoL session regarding acquiring terminology, customising a few online MT engines, comparing and evaluating their output, deploying them, and integrating them into your workflows. Looking forward to hearing from you on Twitter!