Mojibake at Flickr

Русская версия

Since the middle of August Flickr stopped showing search requests where my photos were displayed. Actually it shows them but there are mojibake (or кракозябры) instead of Russian words.

A detailed statistics is available to Flickr Pro-accounts, if the user asked for it. Thanks to this statistics, I found my photos at Wikipedia and at other sites.

This is how it looks on the referrer page:
Part of Flickr referrer statistics page

English search terms are displayed correctly (it would have been weird, if it had been otherwise). Search terms 37 and 38 are undecipherable. If I click them, nothing is found, naturally.

Double encoding to UTF-8

I conjectured it was because of encodings. The issue is not with the encoding of the page (Russian titles of photos are shown correctly), it’s in the depths of Flickr.

In most cases nowadays, Internet pages use UTF-8 encoding. In this encoding, a Cyrillic letter takes two bytes instead of one. And such string which is already encoded in UTF-8 is encoded in UTF-8 once again. Naturally you get gibberish. But it was only my guess.

Encoder & decoder in Java

To prove I’m right, I wrote a small program in Java:

import java.io.IOException;


public class DoubleEncoding {

    public static void main(String[] args) throws IOException {
        // Step 1
        String so = "океан";
        byte[] bb = so.getBytes("UTF-8");
        for (byte b : bb) {
            System.out.print(Integer.toHexString(b & 0xFF) + ", ");
        }
        System.out.println();

        // Step 2
        String sc = new String(bb, "Windows-1252");
        System.out.println(sc);
        bb = sc.getBytes("UTF-8");
        for (byte b : bb) {
            System.out.print(Integer.toHexString(b & 0xFF) + ", ");
        }
        System.out.println();

        // Step 3: reverse conversion
        String se = new String(bb, "UTF-8");
        bb = se.getBytes("Windows-1252");
        String sd = new String(bb, "UTF-8");
        System.out.println(sd);
    }

}

After running it, you get:

d0, be, d0, ba, d0, b5, d0, b0, d0, bd, 
океан
c3, 90, c2, be, c3, 90, c2, ba, c3, 90, c2, b5, c3, 90, c2, b0, c3, 90, c2, bd, 
океан

Change the line in step 3 to one of the strings copied from Flickr stats page, such as:

String se = "океан";

And you’ll get the decoded Russian word:

океан

So my theory is proven.

Flickr technical support

At once I wrote a message to Flickr technical support where I described my problem in much detail. They gave me a runaround as if I couldn’t login to Flickr. Once more I described my problem in details and attached the above program which demonstrates what goes wrong. Still I have no reply from them. And the issue haven’t been fixed.

Any language that uses characters outside of ASCII is prone to mojibake. This includes German, Spanish, French, Greek. But everyone speaks English, isn’t it true?! That’s why no one see this bug.

GUI decoder

I want to know what Russian search terms are. And I wrote a more convenient tool to decode mojibake:
Flickr decoder

Using GUI is much easier: copy a search term from browser, paste it to Search term edit box, and get the result in Decoded edit box below.

This application uses the same algorithm from step 3 in the source code above.

Installation

How do you use it? Download decodeFlickr.jar to your computer and double-click it to run.

Since the application is written in Java, you must have Java Runtime Environment (JRE) on your computer: download the latest version and install it.

There’s a bit of complexity… But on the other hand such Java app works the same way not only on Windows, but on Linux and Mac — everywhere if Java is available.

Java Web Start

I wanted to deploy the app using Java Web Start, JNLP, but I failed. The application starts but it has no access to clipboard because of security. The .jar file has to be signed (which is quite expensive), or the app has to ask for ClipboardService. I feel like lazy doing it.

Advertisements

One Response to “Mojibake at Flickr”

  1. Cyrillic displays as question marks in Flickr Stats | Programming is my life Says:

    […] last year in August. Before August 2013, the stats showed Russian queries, and then they became a mojibake. But at least it could be […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: