Can automatic speech recognition already help verbatim reporters?
The role of “verbatim reporter” is rather specific to International Organisations such as the UN, or to law courts, among other institutions, and so far it has been performed either live by people with specialised training as palantypists or stenographers, or after the event by people who listen to a recording of the event and type as quickly as they can, or who use speech recognition technology to re-speak (dictate) what they hear through a speech recognition system which converts their voice into electronic text.
But what if a certain meeting is so important that an accurate verbatim record must be released immediately after it ends? Is there any way to cope with such a situation right now while a growing number of palantypists complete their extensive training? I asked a friend if he knew of any environment which would take a live audio stream, convert it into text, and at the same time, while continuing to add new text to the page as the audio stream continues, also allow humans to post-edit the previous automatically-recognised text. He remembered a set-up where journalists had created a bot which used the Google Voice API to ‘type’ automatically into a shared Google Doc, while the humans were editing the same doc. What a brilliant idea! Thank you, Laur!
Given that my bot-creation skills are on par with my flamenco skills (no, don’t get your hopes up…), I tried to mirror this set-up as follows:
- one browser window was playing a live stream from the UN (I wanted a stream with several voices to test how well Google Voice would deal with it);
- an external microphone picked up the streamed sound and fed it back into the Voice Typing functionality in Google Docs (in the end I used the mic on my Microsoft webcam because my Plantronics headset was not picking up the sound, and my attempts to use my Mac-Parallels-Windows set-up, as well as Soundflower to capture the sound and use it directly in GDocs failed miserably);
- initially, I had a separate browser window (first on the Mac, then in Parallels) for the post-editing part of the job, but losing mouse focus switched off Voice Typing in GDocs, so I fired up a separate laptop for the post-editing.
What did I learn? Quite a bit; some things were common-sense, some were new to me – I hope they also help you somehow.
- first of all, changing speakers seemed to confuse Google Voice, which either froze or took a fair bit of time coming up with anything; It was therefore necessary to keep an eye on the microphone icon in GDocs to switch it off and switch it back on in order to get it started again.
- at first glance (please see the video to check for yourselves), the speed and manner of speaking influenced the quality of the automatic recognition much more than the gender of the speaker (for instance, I though that the lady speaking from 2″45′ was much better recognised than the lady who was the Arabic interpreter, and the differences between their manners of speaking were clear, even though their setting was different, too, and could not really be changed: the interpreter did not have the luxury of the delegate from the USG DGACM of speaking at whatever pace she preferred, because the interpreter had to keep up with the Arabic delegates.
- thirdly, and most interestingly for me, GDocs did not allow me to edit the automatically-transcribed text immediately after an error was made. Instead, the recognised text was uneditable for a few seconds and occasionally would change, too – I assume this is because it was kept as context and further live disambiguation and correction were performed on it even after the speaker had uttered those words. This live disambiguation and correction was quite fascinating to watch, but at the same time it also meant that my short-term memory could not keep all the errors that needed to be corrected, so in the end this set-up was not ideal from my point of view.
In conclusion, perhaps people with better-trained short-term memories could use this set-up live in a meeting, but for the rest of us, a better situation for me would have been to be able to have the live recognition in Google Docs, but have a separate audio channel with the sound stream delayed by 5-7 seconds, which would allow Google Voice to finish running its own disambiguation and correction algorithms, thus making it possible for the human to post-edit effectively. Moreover, ideally the separate sound stream could be paused/slowed down/sped up as needed.
Of course, such a set-up would be better, but still not perfect, because it would still depend on the quality of interpreting in multilingual meetings. UN précis writers often use their languages whenever they can by combining the interpretation provided with their own knowledge of the speaker’s language in order to fill in any gaps or inaccuracies which occur when interpreting.
However, rather than being disheartened by not having reached perfect yet, I think we should be suitably impressed with the major progress which automatic speech recognition technology has been making, and we should find creative ways to make the most of it in our daily work.
Here is the video (if you click on its title at the top of the embedded window, then you will be able to watch it directly in YouTube in high-def so that you can also see clearly what is happening on the screens):