Google proposes standard for making Ajax crawlable

seldo · on Oct 8, 2009

Indexing AJAX websites is important, but I don't think much of this proposal.

As far as I can tell, their plan is to create a standard way that AJAX hash links can be translated into real links, which Google will then crawl, expecting that you are running a machine-readable web browser on your server(!) that will render the full page.

Expecting everyone to make a big modification like installing a Java-based browser to their infrastructure is crazy.

A much more sensible solution is the one webmasters should be doing anyway: progressive enhancement. Any AJAX link, when clicked with Javascript disabled, should render the page as it was intended, just more slowly and without animations etc.. Then your site is completely indexable, and as a bonus it's more robust and accessible too.

With a modern MVC framework this is really not that hard to achieve: your AJAX controller should just be spitting out a chunk of HTML, and whether it gets integrated into your template at the server-side or on the client by Javascript should be irrelevant.

unlinkedlist · on Oct 8, 2009

To be fair, the "standard" part of this proposal doesn't really have anything to do with headless browsers. You could just as easily have a typical server-side app that generates these URLs as appropriate.

For example, a common approach to presenting content in an AJAXy kind of way is to just put a path after the hash. E.g.

    http://www.example.com#/products/bungee-cords

It's probably really easy for the web app to figure out the right content to serve up for #/products/bungee-cords, headless browser or not.

I know you'll be saying, "Of course, but apps should be doing that anyway with progressive enhancement." That's all well and good, but that leads to two separate sets of URLs: one that most users see, and one that search engines see. This fixes that.

seldo · on Oct 8, 2009

I don't understand what you're saying. Firstly, web apps can never see anything after a # in a URL: the browser does not send it to them. Javascript can see it, but I don't think that's what you meant.

With progressive enhancement, your URLs should look like http://www.example.com/products/bungee-cords. If clicked they should render the page as appropriate. JavaScript, if enabled, will attach a handler to the links and when it sees you trying to click on /products/bungee-cords will halt that click and instead make an AJAX request for the equivalent content. There should be no need to have # links whatsoever.

boucher · on Oct 8, 2009

That ignores the fact that people want the URL field to reflect the current state of things, so that URLs can be copied out and pasted elsewhere.

It also ignores the fact that there is no way to set the URL field (through JavaScript or any other mechanism) without actually triggering a page change.

revetkn · on Oct 8, 2009

"Expecting everyone to make a big modification like installing a Java-based browser to their infrastructure is crazy."

I agree. Google is (IMHO) doing this because they view GWT as the near-term future of web development and it seems likely this will be baked into the framework sometime soon (see http://code.google.com/p/google-web-toolkit/source/browse/br...). Katharina Probst and Bruce Johnson are both GWT team members, and as far as I can tell Dr. Probst is leading the engineering effort on the GWT side to support this.

It seems to me that if you're not using GWT, this will be far too much effort for you to implement.

simonw · on Oct 8, 2009

That's horrible. Why is everyone so desperate to make everything depend on JavaScript? Especially Google, with GWT and Google Moderator (a great example of a site that could work just fine without any JavaScript at all). I'm all for progressive enhancement and making the user's experience better if they have JavaScript enabled, but hiding everything behind a single GWT script tag and then using grotesque hacks to make the content crawlable is just crazy.

boucher · on Oct 8, 2009

There are large classes of web applications which can not work without javascript. Period.

Nobody is saying everything needs to be that way. But some things do. This is an attempt to make those things work better with search engines.

qeorge · on Oct 8, 2009

Selling web applications that don't work without Javascript is legally dangerous (discriminates against disabled users). There's a lot more to accessibility than making sure your site works without Javascript, but if you're missing this piece its somewhat of a moot point.

There's never been a case that's gone to verdict on the matter, so its still somewhat of a gray area. But the reason they don't go to verdict is because the offending company has always agreed to a large cash settlement. Target's $6MM settlement with the Federation for the Blind is the most recent case.[1]

There's plenty of large websites that drop the ball here (e.g., Reddit), but that doesn't mean its safe to do so. Ignore accessibility at your own risk.

[1]http://www.webstandards.org/2008/08/28/what-the-target-settl...

boucher · on Oct 8, 2009

If you honestly believe that people shouldn't make applications that don't work without JavaScript, then you are without question saying that there are applications which should not be built on the web. I respect that opinion, but I don't agree with it.

ique · on Oct 8, 2009

Why can't blind people have browsers with javascript? I've never thought about it, but text to speech and the like should not require that the browser does not have javascript?

kls · on Oct 9, 2009

As a person with disabilities (100% color blindness) I have to call foul on this one. One of the most out of the box solutions for A11Y is build into the Dojo toolkit, a JavaScript framework. By the flip of a style class an entire application can be adapted to a person with a specific disability. Ajax done correctly makes targeting persons with disabilities simpler not more complex.

http://www.dojotoolkit.org/developer/a11yStatement

coderdude · on Oct 8, 2009

>>This is an attempt to make those things work better with search engines.

Can you give an example of a Web application, which relies on Javascript to function, which also has content that even _should_ be indexed but is hiding behind Ajax?

Does Google need to index Backpack or something?

boucher · on Oct 8, 2009

Well, that's where it gets tricky. Separating the content from the application usually isn't easy.

280 Slides the application doesn't need to be indexed. They don't need to know where the New button is. They don't need to know where the scrollers are.

But shared presentations do need to be indexed. Fortunately, it's not all that difficult to construct the content in such a way that it can be indexed. We give out unique URLs for presentations which aren't # delimited, so we could actually serve the text content alongside the viewable presentation. We don't currently, but it would be a relatively easy change.

coderdude · on Oct 10, 2009

Thanks, I was looking for clarification that what Google is really after is not the innards of the Web applications so much as the content that can be considered "standalonable."

jimmybot · on Oct 8, 2009

Why is Google telling people to run a headless browser just for their crawler? Sounds way too complicated for any small-scale website. Why doesn't Google run the headless browser themselves? It sounds like they are offloading parsing costs onto websites, but then they have a cheating problem they have to deal with.

Somewhat related: It's not AJAX, but I remember hearing Microsoft Research a couple of years ago talk about parsing CSS. They wanted the page as an image so that they could analyze pictorially what was a menu, what was a header, what was main content, etc. It seemed fairly neat for the time, although I think there are much simpler heuristics for when you are after just the main article type content of a page that will work for most blogs/CMS's out there.

(No, they didn't use IE. I asked. Yes, they tried. They said it was slow and crashed too much.)

endergen · on Oct 8, 2009

That's great. Exactly what I'd been thinking about lately.

Avi Bryant: I was just thinking of pinging you about your approach to generating html in Clamato. Are you going to be trying to unify client-side/server-side html generation to solve exactly this problem?

Some thoughts, Google suggests using a headless browser to generate the static version of the dynamic content. This may be the only option for existing code bases. But with server side javascript and by extension Clamato there must be a more elegant solution to generate clean static html as well having a dynamic UI client-side. This seems like something you would have thought a lot about while building The Seaside Framework (http://seaside.st) and now Clamato (http://clamato.net/).

avibryant · on Oct 8, 2009

Server-side javascript is only relevant here if it has access to a working DOM, since that's what any HTML generation or templating is going to be based on. At that point you basically do have a headless browser, whether you call it that or not.

I find Google's proposal goofy; I think I agree with the poster who suggested feeding google static HTML (probably via a sitemap) which references canonical URLs that might be ajaxy, rather than having it crawl an ajax site and do this magic token manipulation to get the static HTML equivalent.

keltex · on Oct 8, 2009

This is really an important idea. For example part of a website I'm working on displays "customer testimonials".

The right way to do this for user experience is to to allow the user to cycle through the testimonials using AJAX. This allows for some nice looking UI and is faster because less data is exchanged with the server.

The right way to do this for SEO is to have each page with a different testimonial on it with it's own custom title, meta description, and SEO friendly URL.

Which way we go is still undecided.

qeorge · on Oct 8, 2009

Why not do both? Make standard links with fully qualified, working URLs, and use Javascript to replace the links' normal behavior with AJAX. This is often called HIJAX [1], and it works just fine for normal users, disabled users, and crawlers alike.

Check out my company's homepage for an example: http://illuminatikarate.com/

You'll notice that when you click to our other pages the content slides in as though its all contained in the original page. But if you have JS disabled, they behave as independent pages. Google has indexed all the pages on the site separately[2], so we're getting the full SEO benefit while also maintaining user experience for our disabled customers. Its a win for everyone.

Ping me if you're having trouble, I'm happy to help.

[1] http://www.domscripting.com/blog/display/41

[2] http://www.google.com/search?q=site%3Ailluminatikarate.com

mp3jeep01 · on Oct 8, 2009

Interesting post, and I do agree, it is important.

Your testimonials example is good, and is definitely a reason to promote the idea, but as with anything, I suppose there are several different angles on this. In the case that you use AJAX to drive a "linear" process (say a series of login screens), there's no advantage to having Step 5 indexed and accessible, without ever beginning at Step 1.

Worst case, we can always fall back to the flash band-aid, keep the same HTML content all on the same page, just place it -10000 off-screen where only the robots can locate it...but this, along with the "right way to do this for SEO" above are labor intensive...

jimmybot · on Oct 8, 2009

Are you sure it's XOR? Anything wrong with doing it through an iframe? You use javascript to control which is displayed and for responsiveness (your cycling), but each testimonial also has its own page.

csytan · on Oct 8, 2009

Have you considered parsing the page using javascript, or selectively returning a partial on an ajax request? I use the first method quite a bit. It's a bit slower to send the whole page, but does not require any server side changes.

Once you get the html, libraries like jQuery make the parsing and replacing very easy.

$(html).find('#testimonial').replaceAll('#testimonial');

unlinkedlist · on Oct 8, 2009

Simpler still:

    $('#testimonial').load('/testimonial-3 #testimonial p');

tjpick · on Oct 8, 2009

I think they've found a very complicated solution. Wouldn't it be much much easier to:

1. give the content behind the ajax call a static url, which can be accessed by non-js clients too

2. add something like <link rel="alternate" media="ajax" type="text/html" href="http://example.com/doc.html#state /> to the head of that static version, which the search engines can then use as a pointer to back the js version

ramanujan · on Oct 8, 2009

Perhaps this is infeasible for some reason, but why can't they just use a (higher performance) version of Selenium or something similar?

1. Use machine learning to identify page elements likely to produce AJAX responses. Not too hard to do this, especially if you actually render the page in (say) Chrome and use the 2D layout in addition to the 1D HTML/CSS as part of your feature set.

2. Use your (souped up, ultra fast) Selenium replacement to play with all those AJAX features.

http://www.reasonablyopinionated.com/2009/01/how-to-test-aja...

...on the third hand, perhaps they're thinking that anyone technically savvy enough to set up an AJAX site (or set up sitemaps or the like) can run this headless browser.

coliveira · on Oct 8, 2009

I think they are trying to make people do work that they should be doing. The headless browser is something that Google needs to apply to their crawling process.

The work of webmasters should be making the URLs available -- Google has to come up with ways to interpret the content. This is similar to asking web servers to provide an HTML version of PDF documents that they are serving.

rimantas · on Oct 8, 2009

Hm. If it is content, then it should work in such a way that it can be accessed no matter what tries to get it: webcrawler, javascript incapable browser or user with a "normal" browser but JS switched off. If user agent is ajax-capable, then use ajax. Google for "hijax".

If it is a webapp, then making it crawlable does not make sense anyway.

qeorge · on Oct 8, 2009

Fundamental problem: the URL fragment after the hash isn't sent to the web server as part of the request. Am I wrong?

coderdude · on Oct 8, 2009

Google will scan your documents for URLs like "/something.php#!some_ajax_action, and will rewrite those URLs without the fragments (a method they describe in the article, if you read it).

qeorge · on Oct 8, 2009

Thanks, I missed that on the first time through. Quite an awkward implementation, IMHO.

monos · on Oct 8, 2009

maybe i'm thinking of the wrong kind of content they want to crawl, but isn't it easier to provide special crawlable, plain html pages (possibly hidden to users or provided as an "accessible" version).