Category Archives: Uncategorized

HTML vs. Flash for Arabic text and video

3 Replies

We recently realized that the rendering of Arabic script overlaying the videos on Mightyverse is incorrect. For example, the phrase:
كيف أوصل هناك؟ currently appears on the website differently in the search results (which are rendered in HTML) vs. the video (rendered in Flash) as seen in the screenshots below:

This was brought to my attention by Samar Moushabeck, an Arabic teacher at the Deerfield Academy in Massachusetts. Since I don’t read or speak Arabic, I had to zoom in to understand the difference.

Correct:

Incorrect:

This is clearly unacceptable. It would be (sort of) like if we wrote the English as:
“Ho w w oul d y ou re com mend I ge t t here”

The letters are correct, but the sentence reads like we are illiterate (which, of course, we are in Arabic, but we’re hoping to improve with the help of Samar and language experts like her).

There are a few possible solutions to this.

1) Move the text from video overlay to below or above the video. This is technically simple solution. However, the thinking behind the current design is that it is helpful to be able to read the text while focusing on the movement of the speaker’s lips. The farther the text is from the video the harder it would seem to be for visual learners (which applies to most of us humans).

2) Render the text in HTML and overlay Flash. I want to run screaming from this solution, since I have some experience trying to intermingle HTML and Flash and found it to be time consuming engineering to get it to work correctly across browsers and, even with a lot of work, had to compromise visual design and/or user experience in some cases.

3) Render the text as a graphic, dynamically load into Flash, then overlay within Flash. This seems an awful lot of work to support my theory of more effective learning and a preference for the visual design.

4) HTML5 video, reported to support overlay for captioning seamlessly. While I’ve read that YouTube will continue to use Flash instead of HTML5 video, Mightyverse has short format video so the constraints for YouTube may not apply. My favorite quote from a Mark Pilgrim HTML5 video article is “support for the <video> element is still evolving, which is a polite way of saying it doesn’t work yet. ” It appears that to support HTML5 video across browsers, we would want to support the new WebM standard (with VP8 video and Vorbis audio) along with the H.264 that we already encode for iPhone and Android.

In any case, we have a few things in the queue (like re-releasing the iPhone app for iOS4) before we can address this on the site, which gives us a little time to explore options. Please leave a comment if you have experience with HTML5 video and/or Arabic text in HTML or Flash and are willing to share some insights.

Na'vi language code

3 Replies

Today, I needed to look up a language code for Na’vi, the language natively spoken on the fictional moon Pandora, created by Paul Frommer for the movie Avatar. When adding a new language to Mightyverse, we record the ISO codes and use them in the URL of a search and someday we’ll cross-link with amazing language resources like Ethnologue. Usually these codes are easy to find. Not today.

I worry when the most authoritative reference I can find is wikipedia, which reports that the ISO 639-2 code is ‘art’.

Language identification moves at infrequent intervals. It is not like being assigned a port number. I read further that the most recent ISO 639-3 change request list was approved on January 20, 2010 and I didn’t see Na’vi there. It is likely queued up with the 2010 change requests.

I read that ISO uses the prefix ‘art’ for artificial languages. (The art code is on the 639-2 list and therefore also part of 639-3). So, the full RFC3066 (really RFC4646) code would be ‘art-nav’, which is also the wiktionary code. So, until we get the official word in January 2011 or whenever RFC4646 codes get ratified, here’s what I’m going with:

ISO 639-1: n/a
ISO 639-2: art
ISO 639-3: art
RFC3066 : art-nav

Ruby language

6 Replies

I attended RubyConf in San Francisco last month, which is an annual conference about the Ruby programming language. Yukihiro Matsumoto (“Matz”), the creator of the Ruby language, gave the keynote about the 0.8 true language: a language can’t be good for everyone and every purpose, but we can strive to make it good for 80% of what is needed in a programming language. He talked about domain-specific languages (DSLs) and, while he was talking about programming DSLs, it struck me that we, as humans, commonly invent domain-specific languages that transcend our social cultures, instead encoding a culture that crosses national and linguistic boundaries.

Yuki Sonoda organized a series of talks called East meets West with presentations by Japanese Rubyists. In her talk description, she makes the case for our bridging the Japanese-English language gap between Ruby programmers:

Ruby needs your help. There are many issues. But there are too few developers. 92% of Ruby’s development in this 3 years were done by only 10 developers. 73% were done by only 5 developers. Ruby seems to be a cathedral project rather than a bazaar project.
There must be many reasons for this situation. I think a large reason is the language barrier between English-speaking Ruby world and Japanese-speaking Ruby world. So I will talk about how to solve this problem.
All of the top 10 committers speak Japanese and live in Japan. So they discuss in Japanese. Some of the most important decisions are done in these discussions. But this means that most of Rubyists, who do not speak Japanese, can not understand the discussions. For non-Japanese speakers, there has been no way to understand the most important issues in the development of Ruby.
I want to share the current issues of Ruby. I also want to request help from Rubyists who don’t speak Japanese.

There were two “lightning” tech talks given by Japanese Rubyist who each said that it was their first English language presentation. I started to think about what kind of vocabulary I would need to give a tech talk in Japanese or even just to understand one.

I approached Matz after his keynote to ask if he would record some phrases about the Ruby language in Japanese. He agreed, and I set out to capture a dozen or so phrases that would never appear in a phrasebook and might be interesting to say to a Japanese Rubyist at a conference.

I approached random people in the hallways and during lunch and asked questions like: if you would to use the word “closure” in a sentence, what would it be? Jim Weirich came up with my favorite: “Closures may be used to implement objects, and object may be used to implement closures.” Sarah Mei wondered how to read code aloud in Japanese — when would you use a Japanese word and when is the code pronounced phonetically. She guessed correctly that you would say “object.method” phonetically as obujekuto dotto mesoddo. I was intrigued that what Ruby calls the “shovel” operator (<<) is phonetically derived from the bitshift operator which has the same symbolic representation in C and Java, and is thus translated as bitto shifuto.

You can see all of the phrases that Matz recorded on the home page. One of these days, I will make it so there is a direct link. If you program in Ruby and speak English or Japanese, I’d be interested in knowing if there domain-specific phrases you would like to be able to say. I wonder if I learned enough “code” words along with some basic Japanese whether I could actually understand a Ruby Kaigi talk even before learning how to converse in Japanese.

I wish I remembered everyone’s names who suggested phrases. If you read this, please comment so that I can say thanks! and many thanks to the Japanese engineer who translated the phrases for me and her colleagues who helped! In my zealous pursuit of my goal, I neglected to keep track of everyone who helped me along the way. Thank you all.

the deterioration of language

database column limits and utf8 strings

2 Replies

Wolf and I fixed a bug today where we needed to truncate a string of text that we use internally to annotate the database. Now, the annotation is just for our reference, so we limit it to 50 bytes — that’s bytes, mind you, not characters, even though the PostgreSQL database will tell you it is “character varying(50)”

We use unicode internally, specifically UTF8, which is a fabulous and widely used standard. However, it does have a challenging property where a character may be 1-4 bytes long. We were frustrated with what we thought ought to be a simple problem of truncating a string so that it would be no more than 50 bytes. The tricky part, of course, was that the 49th byte might actually fall in the middle of a character.

To solve the problem, we added a method to the Rails Multibyte::Chars class, which is part of ActiveSupport. For those who speak Ruby, Rails and RSpec, below is the solution we came up with (first the spec, then the implementation).

The solution we came up with was borrowed from the private translate_offset method. The key interesting part is that you can discover whether you’ve chopped up a string in the middle of a character by calling chunk.unpack(‘U*’) — the unpack method on String in Ruby will throw an exception when you ask it to interpret the UTF-8 characters as unsigned integers with the “U” directive.

describe "Chars#limit_bytes" do
  it 'should return "" on ""' do
    "".mb_chars.limit_bytes(0).should == ""
    "".mb_chars.limit_bytes(1).should == ""
  end

  it 'should truncate single byte character strings as expected' do
    a = "abcd"
    a.mb_chars.limit_bytes(0).should == ''
    a.mb_chars.limit_bytes(1).should == 'a'
    a.mb_chars.limit_bytes(50).should == 'abcd'
  end

  it 'should truncate multi-byte character strings at character boundaries' do
    k = "こんいちわ"
    k.mb_chars.limit_bytes(0).should == ''
    k.mb_chars.limit_bytes(1).should == ''
    k.mb_chars.limit_bytes(3).should == 'こ'
    k.mb_chars.limit_bytes(4).should == 'こ'
    k.mb_chars.limit_bytes(5).should == 'こ'
    k.mb_chars.limit_bytes(6).should == 'こん'
    k.mb_chars.limit_bytes(7).should == 'こん'
    k.mb_chars.limit_bytes(50).should == 'こんいちわ'
  end
end

module ActiveSupport #:nodoc:
  module Multibyte #:nodoc:
    class Chars
      def limit_bytes(limit)
        limit -= 1 while !valid_boundary?(limit)
        s = @wrapped_string.slice(0,limit)
        s.mb_chars
      end

      def valid_boundary?(length)
        chunk = @wrapped_string.slice(0,length)
        begin
          chunk.unpack('U*')
          true
        rescue
          false
        end
      end
    end
  end
end

We’ve written this up as a Lighthouse ticket in case the Rails folk want to add it to the platform or if other people developing multi-lingual database apps run into the same challenge and look there.

The above code can be used under the MIT License:

The MIT License (MIT)

Copyright (c) 2009 Mightyverse, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Mightyblog

Blog by Mightyverse people