Category Archives: Uncategorized

HTML vs. Flash for Arabic text and video

We recently realized that the rendering of Arabic script overlaying the videos on Mightyverse is incorrect. For example, the phrase:
كيف أوصل هناك؟ currently appears on the website differently in the search results (which are rendered in HTML) vs. the video (rendered in Flash) as seen in the screenshots below:

This was brought to my attention by Samar Moushabeck, an Arabic teacher at the Deerfield Academy in Massachusetts. Since I don’t read or speak Arabic, I had to zoom in to understand the difference.

Correct:

Incorrect:

This is clearly unacceptable. It would be (sort of) like if we wrote the English as:
“Ho w w oul d y ou re com mend I ge t t here”

The letters are correct, but the sentence reads like we are illiterate (which, of course, we are in Arabic, but we’re hoping to improve with the help of Samar and language experts like her).

There are a few possible solutions to this.

1) Move the text from video overlay to below or above the video. This is technically simple solution. However, the thinking behind the current design is that it is helpful to be able to read the text while focusing on the movement of the speaker’s lips. The farther the text is from the video the harder it would seem to be for visual learners (which applies to most of us humans).

2) Render the text in HTML and overlay Flash. I want to run screaming from this solution, since I have some experience trying to intermingle HTML and Flash and found it to be time consuming engineering to get it to work correctly across browsers and, even with a lot of work, had to compromise visual design and/or user experience in some cases.

3) Render the text as a graphic, dynamically load into Flash, then overlay within Flash. This seems an awful lot of work to support my theory of more effective learning and a preference for the visual design.

4) HTML5 video, reported to support overlay for captioning seamlessly. While I’ve read that YouTube will continue to use Flash instead of HTML5 video, Mightyverse has short format video so the constraints for YouTube may not apply. My favorite quote from a Mark Pilgrim HTML5 video article is “support for the <video> element is still evolving, which is a polite way of saying it doesn’t work yet. ” It appears that to support HTML5 video across browsers, we would want to support the new WebM standard (with VP8 video and Vorbis audio) along with the H.264 that we already encode for iPhone and Android.

In any case, we have a few things in the queue (like re-releasing the iPhone app for iOS4) before we can address this on the site, which gives us a little time to explore options. Please leave a comment if you have experience with HTML5 video and/or Arabic text in HTML or Flash and are willing to share some insights.

Na'vi language code

Today, I needed to look up a language code for Na’vi, the language natively spoken on the fictional moon Pandora, created by Paul Frommer for the movie Avatar.  When adding a new language to Mightyverse, we record the ISO codes and use them in the URL of a search and someday we’ll cross-link with amazing language resources like Ethnologue. Usually these codes are easy to find.  Not today.

I worry when the most authoritative reference I can find is wikipedia, which reports that the ISO 639-2 code is ‘art’.

Language identification moves at infrequent intervals.  It is not like being assigned a port number. I read further that the most recent ISO 639-3 change request list was approved on January 20, 2010 and I didn’t see Na’vi there.  It is likely queued up with the 2010 change requests.

I read that ISO uses the prefix ‘art’ for artificial languages.  (The art code is on the 639-2 list and therefore also part of 639-3).  So, the full RFC3066 (really RFC4646) code would be ‘art-nav’, which is also the wiktionary code.  So, until we get the official word in January 2011 or whenever RFC4646 codes get ratified, here’s what I’m going with:

  • ISO 639-1:  n/a
  • ISO 639-2:  art
  • ISO 639-3: art
  • RFC3066  : art-nav

Ruby language

I attended RubyConf in San Francisco last month, which is an annual conference about the Ruby programming language. Yukihiro Matsumoto (“Matz”), the creator of the Ruby language, gave the keynote about the 0.8 true language: a language can’t be good for everyone and every purpose, but we can strive to make it good for 80% of what is needed in a programming language. He talked about domain-specific languages (DSLs) and, while he was talking about programming DSLs, it struck me that we, as humans, commonly invent domain-specific languages that transcend our social cultures, instead encoding a culture that crosses national and linguistic boundaries.

Yuki Sonoda organized a series of talks called East meets West with presentations by Japanese Rubyists. In her talk description, she makes the case for our bridging the Japanese-English language gap between Ruby programmers:

Ruby needs your help. There are many issues. But there are too few developers. 92% of Ruby’s development in this 3 years were done by only 10 developers. 73% were done by only 5 developers. Ruby seems to be a cathedral project rather than a bazaar project.
There must be many reasons for this situation. I think a large reason is the language barrier between English-speaking Ruby world and Japanese-speaking Ruby world. So I will talk about how to solve this problem.
All of the top 10 committers speak Japanese and live in Japan. So they discuss in Japanese. Some of the most important decisions are done in these discussions. But this means that most of Rubyists, who do not speak Japanese, can not understand the discussions. For non-Japanese speakers, there has been no way to understand the most important issues in the development of Ruby.
I want to share the current issues of Ruby. I also want to request help from Rubyists who don’t speak Japanese.

There were two “lightning” tech talks given by Japanese Rubyist who each said that it was their first English language presentation. I started to think about what kind of vocabulary I would need to give a tech talk in Japanese or even just to understand one.

I approached Matz after his keynote to ask if he would record some phrases about the Ruby language in Japanese. He agreed, and I set out to capture a dozen or so phrases that would never appear in a phrasebook and might be interesting to say to a Japanese Rubyist at a conference.

I approached random people in the hallways and during lunch and asked questions like: if you would to use the word “closure” in a sentence, what would it be? Jim Weirich came up with my favorite: “Closures may be used to implement objects, and object may be used to implement closures.” Sarah Mei wondered how to read code aloud in Japanese — when would you use a Japanese word and when is the code pronounced phonetically. She guessed correctly that you would say “object.method” phonetically as obujekuto dotto mesoddo. I was intrigued that what Ruby calls the “shovel” operator (<<) is phonetically derived from the bitshift operator which has the same symbolic representation in C and Java, and is thus translated as bitto shifuto.

You can see all of the phrases that Matz recorded on the home page. One of these days, I will make it so there is a direct link. If you program in Ruby and speak English or Japanese, I’d be interested in knowing if there domain-specific phrases you would like to be able to say. I wonder if I learned enough “code” words along with some basic Japanese whether I could actually understand a Ruby Kaigi talk even before learning how to converse in Japanese.

I wish I remembered everyone’s names who suggested phrases. If you read this, please comment so that I can say thanks! and many thanks to the Japanese engineer who translated the phrases for me and her colleagues who helped! In my zealous pursuit of my goal, I neglected to keep track of everyone who helped me along the way. Thank you all.

the deterioration of language

“The english of today is not what it used to be, but then again, it never was,” writes Guy Deutscher in The Unfolding of Language. In Chapter 3, he includes a fascination chronology on a few hundred years of so-called decline:

  • In comparing the English language to that of two generations ago, a reviewer in the Times Literary Suppliment reminisced that then “a mistake was a mistake not a sign of free expression.”
  • 1946 George Orwell wrote “the English language is in a bad way” compared to the language of previous generations
  • 1848 linguist August Schleicher dismissed the English of is day as “ground-down,” noting “how rapidly the language of a nation…can sink” and that it was likely to further “sink into mono-syllabicity”
  • 1780 Thomas Sheridan reported a recent decline of the English language “during the reign of Queen Anne [1702-14]… it is probably that English was… spoken in its highest state of perfection”
  • 1712 Jonathan Swift wrote “our Language is extremely imperfect… its daily Improvements are by no means in proportion to its daily Corruptions”

English speakers are not unusual in feeling this way about their language.

Take modern German, for instance, which by common consent is a mere shadow of its former glory two centuries ago, in the Golden Age of Goethe and Schiller. That may well be, but during Goethe’s lifetime those in the know were of a rather different opinion. In 1819, the fairy-tale compiler and linguist Jacob Grimm compared the langauge of his day to that of previous centuries, and lamented that “six hundred years ago, every common peasant knew –that is to say practised daily — perfections and niceties of the German language of which the best language-teachers nowadays can no longer even dream.

The thesis of the book is that this it is precisely this destruction of language where the mystery of language creation lies — “all languages change, all the time — the only static languages are dead ones.” (p.55)

We don’t have to wait generations to hear the how languages change. It can be witnessed by traveling to different parts of a country or, more dramatically, by listening for differences in dialects of a single language across multiple countries.

I highly recommend the book which is a little academic at times, but full of fascinating historical and modern references that illustrate how languages evolve.

database column limits and utf8 strings

Wolf and I fixed a bug today where we needed to truncate a string of text that we use internally to annotate the database.  Now, the annotation is just for our reference, so we limit it to 50 bytes — that’s bytes, mind you, not characters, even though the PostgreSQL database will tell you it is “character varying(50)”

We use unicode internally, specifically UTF8, which is a fabulous and widely used standard.  However, it does have a challenging property where a character may be 1-4 bytes long. We were frustrated with what we thought ought to be a simple problem of truncating a string so that it would be no more than 50 bytes.  The tricky part, of course, was that the 49th byte might actually fall in the middle of a character.

To solve the problem, we added a method to the Rails Multibyte::Chars class, which is part of ActiveSupport.  For those who speak Ruby, Rails and RSpec, below is the solution we came up with (first the spec, then the implementation).

The solution we came up with was borrowed from the private translate_offset method. The key interesting part is that you can discover whether you’ve chopped up a string in the middle of a character by calling chunk.unpack(‘U*’) — the unpack method on String in Ruby will throw an exception when you ask it to interpret the UTF-8 characters as unsigned integers with the “U” directive.

describe "Chars#limit_bytes" do
  it 'should return "" on ""' do
    "".mb_chars.limit_bytes(0).should == ""
    "".mb_chars.limit_bytes(1).should == ""
  end

  it 'should truncate single byte character strings as expected' do
    a = "abcd"
    a.mb_chars.limit_bytes(0).should == ''
    a.mb_chars.limit_bytes(1).should == 'a'
    a.mb_chars.limit_bytes(50).should == 'abcd'
  end

  it 'should truncate multi-byte character strings at character boundaries' do
    k = "こんいちわ"
    k.mb_chars.limit_bytes(0).should == ''
    k.mb_chars.limit_bytes(1).should == ''
    k.mb_chars.limit_bytes(3).should == 'こ'
    k.mb_chars.limit_bytes(4).should == 'こ'
    k.mb_chars.limit_bytes(5).should == 'こ'
    k.mb_chars.limit_bytes(6).should == 'こん'
    k.mb_chars.limit_bytes(7).should == 'こん'
    k.mb_chars.limit_bytes(50).should == 'こんいちわ'
  end
end

module ActiveSupport #:nodoc:
  module Multibyte #:nodoc:
    class Chars
      def limit_bytes(limit)
        limit -= 1 while !valid_boundary?(limit)
        s = @wrapped_string.slice(0,limit)
        s.mb_chars
      end

      def valid_boundary?(length)
        chunk = @wrapped_string.slice(0,length)
        begin
          chunk.unpack('U*')
          true
        rescue
          false
        end
      end
    end
  end
end

We’ve written this up as a Lighthouse ticket in case the Rails folk want to add it to the platform or if other people developing multi-lingual database apps run into the same challenge and look there.

The above code can be used under the MIT License:

The MIT License (MIT)

Copyright (c) 2009 Mightyverse, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

how can I say…

Mightyverse is not a translation web site or a language learning site… sure you can use Mightyverse for those things, but they are not the primary raison d’être, they don’t capture the gestalt. The purpose of Mightyverse is to help each of us convey meaning, to help us communicate across language and culture — not by ignoring differences, but by embracing them.

Many phrases cannot be translated directly. You cannot understand it unless you also understand some of the context and culture where that stream of sound evolved. “10-4 good buddy,” “Where’s the beef?” or “LOL” have complex, whimsical meaning that tie to culture. Each language has such phrases and learning them allows us to connect in surprising ways.

Even when listening to someone who appears speak the same language, one can discover that direct translations don’t work. I speak English, but am also fluent in geek. I can hear someone say that their mongrels are all tied up and understand that they are concerned about web site server performance and not the mistreatment of animals. Whatever language you learned as a child, it is likely that you also speak a domain specific language in some aspect of your life.

Humans are driven to invent specific words to resolve problems that we face together, whether it be something as mundane as server performance or as dramatic as global warming.  David Harrison speaks about the value of languages that are currently threatened by extinction. He describes the Yupik of Alaska who have the ability to name, and thereby precisely distinguish, 99 different formations of sea ice. The naming of a thing sharpens our perception of it. Harrison describes language as a technology, and the Yupik language may be one of the most sensitive instruments to detect the signs of climate change and global warming. Perhaps that vocabulary is worth learning.

So when you look at Mightyverse I hope you won’t only see a fun website with entertaining quotes or a tool for learning basic conversational phrases in a foreign language.  I hope you will catch a glimpse of what will happen when it is populated with a wide swath of human language, including dialects and vocabulary too new for the dictionary, languages threatened by extinction, as well as the essential expressions of commonly spoken languages that are just as commonly misunderstood.  We have a million ideas for how we can get from here to there, but we’re taking it one feature at a time, and recording phrases at every opportunity.  We don’t have the community features that we had hoped for at launch, but we do have a forum, twitter and this blog.  We’re interested in hearing what you think.

Helllllloooooo……. !!

So what’s this Mightyverse thing all about?

Well, we travel a lot. And… we like people… a lot.

A favorite peak...

A favorite peak...

In a normal year, I might spend as much as 10 full 24 hour days or more in the air. When I land, I often find myself in the midst of a new culture for a few days, on assignment as a filmmaker. What that translates to – or more accurately doesn’t translate to – is the need to be able to navigate and connect with people in a language that I’m not native in.

One might argue that English is becoming the language of business everywhere. I kinda prefer to try to connect with people on their terms, or more accurately, in their own language.

So what is someone from the midwest who has taken all of about two semesters of French and a smattering of Japanese classes to do?

Surveying the options I’ve tried over the past several years:

– cram for days before each trip and try to resemble a sponge…
– hire translators wherever I go (how do you say, “that’s not in the budget” in that language?)
– be a maniac with a phrasebook (note to self: improve charades skills……)
– make friends with people around the world and have them be my personal translators (Gomen nasai (I’m sorry!), Shimizu-san!)

We had an idea: what if there was something that would allow us to make someone laugh in their native language? Or to express gratitude in a way that was genuine and authentic to the area we were in? What if we had a global utility that was always there whenever we needed it – online or on our phones – that could answer the question, “How do I say ________ in any language?”

It’s a big dream. One that we can’t hold by ourselves. We invite you to be part of something truly global and truly human – to help us create a community and utility that doesn’t exist yet: to be a part of Mightyverse. At the center of this is a bridge that we’ve been seeking between language and cultures – an enormous collection of native speaker video phrases that are cross-translated into other languages.

We’re excited you’re here.

For now, we invite you to browse through an early smattering of phrases (more than 20,000) that we’ve pulled together in over 20 languages. You’re bound to find holes – that’s part of the journey. Soon, you’ll be able to help us fill those in. If you have any thoughts or suggestions, we’d love to hear from you. And if you’re in the bay area and would like to get more involved, drop us a note – we’d love your help!

Welcome to Mightyverse! There’s much more to come…

Glen Janssens

– co-founder, phrasefarmer