Thursday, January 10, 2013

Link Preview using Rails, AJAX, and Nokogiri Gem

I was trying to make some working code to preview link content like Facebook, Google Plus, LinkedIn .. etc. And I have to say, it is not that easy to get it to perfection.

Here are some notes before starting:

  • The happy scenario is to find ready tags inside the HEAD tag of the HTML file. Some times they are there, a lot of times they are not there. In video sharing websites like Youtube or Vimeo, or other news websites who care about these details, you will find meta data to save the day. For example:

<meta property="og:url" content="">
<meta property="og:title" content="Kinetic Scrolling Example [Arabic] [Qt]">
<meta property="og:description" content="Visit my blog entry for more info, and the complete example: Example uploaded on M...">
<meta property="og:type" content="video">
<meta property="og:image" content="">
<meta property="og:video" content=";version=3">
<meta property="og:video:type" content="application/x-shockwave-flash">
<meta property="og:video:width" content="640">
<meta property="og:video:height" content="480">
<meta property="og:site_name" content="YouTube">

  • If these data exist, your work is done. Otherwise, you will have to search inside the HTML document to get some text and images to use for the preview. This mean more calculations and algorithms.
  • A major issue is that Javascript has security issues preventing the process of fetching HTML content of another domain, so the processing on the client side is not possible, and all the parsing has to be on the server side, which means loading the server for a simple feature.
  • Another major issue is that Ruby has no built-in HTML/XML parser. So it is up to you to make your own or search for an alternative. I saved time and used Nokogiri gem to parse HTML and get the data I need.
  • Note that returning HTML data via AJAX is not preferred and can easily break the code. So you'd better return a JSON object and process it on client side. 
  • One final note is that you should take care of text encoding. For example, I prefer to make Arabic support inside my application, so the UTF-8 encoding is important to me.

So here is how the process goes:
  1. Receive pasted URL using Javascript.
  2. Send URL in an AJAX request to your server.
  3. Fetch the HTML content of the URL and parse the useful data.
  4. Send data back to HTML page as a JSON object.
  5. Process the JSON object by Javascript to preview data to user.

1- Receive pasted URL using Javascript:

Javascript does this job. I simply listen to the 'paste' event then get the text inside the textarea. Of course it would be much recommended to validate text first.

$("#post_content").bind('paste', function(e) {
    var el = $(this);
    setTimeout(function() {
        var text = $(el).val();
        // send text to server
    }, 100);

2- Send URL in an AJAX request to your server:

$("#post_content").bind('paste', function(e) {
    var el = $(this);

    setTimeout(function() {
        var text = $(el).val();
        // send url to service for parsing
        $.ajax('/url/to/server/handler', {
            type: 'POST',
            data: { url: text },
            success: function(data,textStatus,jqXHR ) {
                // handle received data
            error: function() { alert("error"); }
    }, 100);

3- Fetch the HTML content of the URL and parse the useful data:
4- Send data back to HTML page as a JSON object:

In this step I use Nokogiri gem to do the dirty work of parsing for me. First remember to add the gem to the Gemfile.
Note: In case of Linux, you may want to install libxslt-dev and libxml2-div before bundling.

gem 'nokogiri' , '~> 1.5.6'

The "param_url"  is the url received on the server side. I pass it to Nokogiri then play with the document object returned. The easiest way is to iterate on the mate tags in HEAD and get the strings I want. Here you may make more effort to parse data from the BODY if the meta tags were not helpful.

doc = Nokogiri::HTML(open(param_url), nil, 'UTF-8')
title = ""
description = ""
url = ""
image_url = ""

doc.xpath("//head//meta").each do |meta|
    if meta['property'] == 'og:title'
        title = meta['content']
    elsif meta['property'] == 'og:description' || meta['name'] == 'description'
        description = meta['content']
    elsif meta['property'] == 'og:url'
        url = meta['content']
    elsif meta['property'] == 'og:image'
        image_url = meta['content']

if title == ""
    title_node = doc.at_xpath("//head//title")
    if title_node
        title = title_node.text
    elsif doc.title
        title = doc.title
        title = param_url

if description ==""
    #maybe search for content from BODY
    description = title

if url ==""
    url = param_url

render :json => {:title => title, :description => description, :url => url, :image_url => image_url} and return

5- Process the JSON object by Javascript to preview data to user:

Finally, the data is returned to the Javascript on the client side. Be creative with handling the data and viewing it to the user. Here is a single line as an example of handling data inside the 'success' handler.



  1. You've said "A major issue is that Javascript has security issues preventing the process of fetching HTML content of another domain".

    What do you mean by the security issue?
    Is it the Cross Domain Policy you are speaking about ?

    If that is so you can use JSONP request that will enable you to send requests to different domains.

    1. Well, I read that is has limitations too. I cannot access the content and can just execute it. Right?

    2. Well yes, you are right you cannot access the HTML content directly, since the JSONP callback will contain the HTML code as an argument to it.

      Example: jsonCallback(<DOCTYPE><HTML></HTML>)

      Which of course is a Javascript error.

      So you can send a JSONP request to a proxy server that gets you the HTML content as a string and then take that string and supply into a HTML parser and you are good to take whatever information you need from the content.

      or you can even request from a proxy server to get the content you want as a JSON object

  2. Really a nice post...This blog helps me a lot..:)

    Thank You very much admin..

    Please keep posting

    Programming Pad