James Andrews: Language Isn't A Firehose: James Joyce And The Future of Computerized Translation (Bloomsday Edition)

Nearly 100 years after Joyce wrote his polyglot seminal work, are we any closer to a technological solution to breaking down the barriers of language? Not if the recent scuffle over Google Translate is any indication.

James Joyce with Sylvia Beach

Joyce and the Limits of the Twentieth Century

In celebration of Bloomsday (June 16, the 107th anniversary of the fictional events that occur in his Ulysses), I'll reach beyond time, death, and the limits of my own or anyone else's knowledge to affirm that James Joyce would have adored Google Translate.

The Irish novelist was first a translator, a student, and teacher of modern languages. He composed Ulysses over eight years in exile, on the run from World War One, supporting himself teaching English to the Italian, German, and French speakers of Trieste, Z�rich, and Paris. Part of Ulysses's celebrated difficulty is its untranslated bits of these three languages, plus snatches of Latin, Greek, Hebrew, Spanish, Irish Gaelic, Norwegian, and more.

After finishing Ulysses, Joyce set to work on Finnegans Wake, an even more ambitious stew of musical puns in languages he barely knew, all incorporated into his own dream-logic idiolect of Irish English. Joyce listed 40 separate languages on the final page of Finnegans Wake's manuscript. To represent reality in all its fullness, it wasn't enough to mine English, but all human languages that ever were, are, or could be.

That's what Joyce did, starting a century ago, with an amazing education but using the most meager resources. Today, with so much more at our disposal, our instinct is not to push the boundaries of language, but to retreat.

The Death and Resurrection of the Google Translate API

Case in point: Last month, Google began the process of killing its Translate API. If it was a retreat, it's probably most fair to call it a tactical one. Nevertheless, this announcement shocked a lot of people, especially developers who'd baked Google Translate into their products:

Google "deprecates" its public APIs all the time, usually folding their functionality into new products. In May alone, the company announced 18 APIs would be shuttered. Yet neither the APIs for Wave or Virtual Keyboard imminent demise led to the public rending of garments. Translate's did. Seriously, read the comment thread. There was even a "Don't Shut Down Google Translate API" Facebook Group.

On June 3, Google backpedaled, realizing it could solve its growing developer unrest problem by charging them money for something the company once gave away for free. (Would that it were ever thus!)

In the days since we announced the deprecation of the Translate API, we?ve seen the passion and interest expressed by so many of you, through comments here (believe me, we read every one of them) and elsewhere. I?m happy to share that we?re working hard to address your concerns, and will be releasing an updated plan to offer a paid version of the Translate API. Please stay tuned; we?ll post a full update as soon as possible.

What, exactly, was the problem? There is, as Google's official announcement notes, the issue of overuse and abuse (translating spam, for instance) of a free, common resource. But Asia Online's Don Wiggins, writing in a guest post for machine translation blog eMpTy Pages, zeroed in on a bigger and more unexpected problem with Google allowing anyone and everyone unfettered access to the API: It was being destroying by its own use. To borrow Joyce's quip in A Portrait of the Artist as a Young Man: "Ireland is the old sow that eats her farrow." Instead of Ireland, read Google Translate. Says Wiggins:

An increasing amount of the website data that Google has been gathering has been translated from one language to another using Google's own Translate API. Often, this data has been published online with no human editing or quality checking, and is then represented as high-quality local language content....

It is not easy to determine if local language content has been translated by machine or by humans or perhaps whether it is in its original authored language. By crawling and processing local language web content that has been published without any human proofreading after being translated using the Google Translate API, Google is in reality "polluting its own drinking water."...

The increasing amount of "polluted drinking water" is becoming more statistically relevant. Over time, instead of improving each time more machine learning data is added, the opposite can occur. Errors in the original translation of web content can result in good statistical patterns becoming less relevant, and bad patterns becoming more statistically relevant. Poor translations are feeding back into the learning system, creating software that repeats previous mistakes and can even exaggerate them.

Wiggins has a stake in this, since he's in the machine-translation-plus-human-correction business, but he's absolutely right. James Fallows breaks it down in a blog post for The Atlantic:

That is the problem with a rapidly increasing volume of machine-translated material. These computerized translations are better than nothing, but at best they are pretty rough. Try it for yourself: Go to the People's Daily Chinese-language home site; plug any story's URL (for instance, this one) into the Google Translate site; and see how closely the result resembles real English. You will get the point of the story, barely. Moreover, since these side-by-side versions reflect the computerized-system's current level of skill, by definition they offer no opportunity for improvement.

That's the problem. The more of this auto-translated material floods onto the world's websites, the smaller the proportion of good translations the computers can learn from. In engineering terms, the signal-to-noise ratio is getting worse. It's getting worse faster in part because of the popularity of Google's Translate API, which allows spam-bloggers and SEO operations to slap up the auto-translated material in large quantities. This is the computer-world equivalent of sloppy overuse of antibiotics creating new strains of drug-resistant bacteria.

To do machine translation well, you don't just need "big" data, you need "good" data. The more natural-language translations you have to suck data from, the better machine translations can become. The more translations that are done badly by machines, then corrected by humans cursorily or poorly (if at all), the worse the raw data gets.

For French or Spanish or Japanese, these badly generated texts are just noise around the margins of all the good translations we already have; a rounding error that spits out the occasional infelicitous phrase. The real trouble is with languages that the fewest number of us know how to read, write, or speak. The languages we're most likely to need machine translation for are exactly those with the fewest number of natural-language translations--the fewest number of crawlable websites, period. Translating documents from China is an annoyance, but Slovak, Bengali, or Malay pose tough problems for any algorithm.

By charging access for the API, Google cuts off the spigot for the spammiest, skeeziest, and too-flat-broke-to-hand-check-the-text users and abusers. Now, Google Translate doesn't have "users"--it has partners, who also have a stake in ensuring that translation is done well and remains high-value. It's an extremely clever solution to the sow-eating-its-own-farrow problem. If only Google had thought of it first--unless this has all been a clever act of jiujitsu on a global scale.

Ray Kurzweil: The Turing Barrier By 2029?

Nataly Kelly, a researcher for Common Sense Advisory, recently interviewed noted futurist Ray Kurzweil about the future of machine translation technology, its limitations and its stakes:

[vimeo 25021517]

Kurzweil notably skirts over the minor-language problem:

We'll get to a point where computers have human levels of language understanding. They'll be able to do the same level of language translation that the best human translators do. I think that's in 2029. That's what I would call a Turing Test level task. Alan Turing based his anonymous test on human language, and in order to pass the Turing Test, you have to know human language at human levels. And presumably, once a computer can do that for one language, it can do that for any language.

That's a big presumption. But at the same time, Kurzweil is exactly right that the route to genuine machine learning and artificial intelligence runs through language. That's what IBM's Watson project is all about--beating humans at Jeopardy is just a fun side effect. And Kurzweil is also right that (as Kelly summarizes) "very few people can actually master more than a handful of languages, and that ultimately, we will expand our intelligence through technologies that enable us to learn other languages more quickly." Machines will not and cannot replace us: The shortcomings of Google Translate show that our future can and must be a cyborg future, where humans and tools aid each others' learning and creating positive rather than negative feedback loops.

In other words, we should not need to be Joyce-like to be Joyceans, or to read his difficult books with pleasure. More importantly, we should not need to be Joyce-like to connect with all speakers and writers everywhere in the world, or to harness their news, their collective intelligence, and their wisdom congealed and concealed at present in their languages.

"Even the best translators can't fully translate literature," Kurzweil points out. "Some things just can't be expressed in another language. Each language has its own personality." In some cases, like Joyce's, the language has more than one.

SYNNEX SYMANTEC SYKES ENTERPRISES INORATED SYBASE SUN MICROSYSTEMS

James Andrews

Friday, June 17, 2011

Language Isn't A Firehose: James Joyce And The Future of Computerized Translation (Bloomsday Edition)

No comments:

Post a Comment