kata ta biblia

a blog exploring Christian origins, biblical studies, social/cultural history, method, education and the journey through academia

Category: linguistics

Imagining a Google (Ancient) Translate

Continuing on my theme of imagining the usefulness of interesting technologies for the work of biblical studies and ancient historians, I have heard lately quite a bit about how innovative Google’s online translator is. I wonder what would happen if its resources were directed towards ancient languages and texts.

First, let me point you to the stuff I’ve been hearing. Two of the podcasts I listen to regularly recently discussed the phenomenon of Google’s translator: On The Media’s story “Bridging the Online Language Barrier” and a story from The World in Words podcast. The former story brings an interesting discussion on the history of automatic translators that’s worth sharing here:

PROFESSOR DAVID BELLOS: After the war, just as computers were being invented, the bright idea came that maybe you could use these wonderful new machines to do code cracking and that maybe languages could be looked at as if they were in code, as if the real meaning of the thing was actually the English and the Russian was just, you know, one of these complicated ways of masking what the real meaning was.

MARK PHILLIPS: First you teach the computer vocabulary, apple equals yablaka, and then you teach it all the rules and grammar, do it for every language and, boom, you’ve got a Star Trek-style universal translator.

PROFESSOR DAVID BELLOS: It didn’t produce the results they wanted.

MARK PHILLIPS: David Bellos:

PROFESSOR DAVID BELLOS: The reason it didn’t was that it was based on not a very sophisticated idea of what language actually is. What I am saying isn’t in code for something else, it is what I’m saying. So there are really very strict limits on what you can do with machine translation, based on the idea of code. By the early 1960s, they’d pretty much given up.

MARK PHILLIPS: This rules-based machine translation was a failure, but there was still another method called statistical translation. Think of it as a behavioral approach. The underlying grammar and syntax don’t matter, but repeated exposure to language, as it’s actually used, does. It’s like how babies learn. You don’t diagram sentences for them. They just hear you say stuff and copy you.

The catch is to teach the machine, you have to load huge amounts of text into the computer. Back in the 1960s, they didn’t have enough data to make a statistical machine translation work. Now we do, says Michael Galvez, a project manager at Google Translate.

MICHAEL GALVEZ: What we do is we actually use hundreds of billions of words that Google infrastructure has access to.

MARK PHILLIPS: It’s a two-step process. First, Google’s computers pull it all in, recognize the language and create what they call a language model. There’s one for each of the 52 languages currently on the service. As they get more data for a particular language, the computers get a better feel for it. It knows from a statistical standpoint that in English, the sentence “The boy are sad” is very rare, just as a five-year-old knows that sounds weird.

But the language model only teaches the computer how to speak each language by itself. The next step is to learn how to go between multiple languages. Google’s Michael Galvez says, for that:

MICHAEL GALVEZ: We also build what’s called a translation model, using previous human translation that we have access to, documents from the EU, the United Nations, very high-quality translation corpora.

MARK PHILLIPS: Everything spoken or written at the United Nations is automatically translated into six languages.

[U.N. HUBBUB/MANY LANGUAGES AT ONCE]]

Google uses U.N. and European Union transcripts, along with tons of other professional high-quality translations, to build this translation model, which allows their computers to take a sentence and predict what it would be in another language. Michael Galvez:

MICHAEL GALVEZ: We take the language model and the translation model and we put these two models together, and we basically create the machine translation system out of this.

MARK PHILLIPS: It produces startlingly accurate results. Plug in an article from a Spanish-language newspaper and it reads like an English article that just needs a trip to the copy editor.

So, what if we took Google Translate and plugged in all the hundreds of ancient texts in ancient Greek, Latin, Hebrew, Aramaic, etc., along with the best translations of those texts available? Imagine what it might be able to do for previously untranslated ancient texts! Of course, as the reporter notes, it would still “need a trip to the copy editor” or the scholar of ancient texts, in this case. One feature that would be nice would be for Google to offer a few options, so that you could choose the translation that seems to make the most sense in this particular context. This could be amazing for epigraphy: as new inscriptions are found, they can be run through the ancient translator and then just fixed up a bit.

Scholarship has become highly specialized. Imagine the possibilities that might be available if these sorts of resources could be made available to people in different fields. Maybe New Testament scholars would finally start paying attention to inscriptions. I know that a lot of scholars would cry foul about this sort of thing and how you still need human translators. Well, of course we do, but why not enhance accessibility for a greater number of scholars? Or even for our students for that matter? As one person interviewed for the story noted: “The solution isn’t machine translation just getting better or human translators just getting more pervasive. The solution is some combination of the two.”

Now all we need is for Google to get on those ancient texts with their translator. How about it Google? Do you want to take over the ancient world as well?

Post to Facebook Post to Twitter Post to Delicious Post to Digg Post to Google Buzz Post to LinkedIn Post to StumbleUpon