How Google deals with non-text files ?

Google can index most types of pages and files (detailed list).
In general, however, search engines are text based. This means that in order to be crawled and indexed, your content needs to be in text format. (Google can now index text content contained in Flash files, but other search engines may not.)

This doesn't mean that you can't include rich media content such as Flash, Silverlight, or videos on your site; it just means that any content you embed in these files should also be available in text format or it won't be accessible to search engines. The examples below focus on the most common types of non-text content, but the guidelines are similar for any other types: Provide text equivalents for all non-text files.

This will not only increase Googlebot's ability to successfully crawl and index your content; it will also make your content more accessible. Many people, for example users with visual impairments, who use screen readers, or have low bandwidth connections, cannot see images on web pages, and providing text equivalents widens your audience.

Google can now discover and index text content in SWF files of all kinds, including self-contained Flash websites and Flash gadgets such as buttons or menus. This includes all textual content visible to the user. Google supports common JavaScript techniques. In addition, we can now find and follow URLs embedded in Flash files. We'll crawl and index this content in the same way that we crawl and index other content on your site—you don't need to take any special action. However, we don't guarantee that we'll crawl or index all the content, Flash or otherwise.
We're continually working to improve our indexing of Flash files, but there are some limitations:
  • We currently do not attach content from external resources that are loaded by your Flash files. If your Flash file loads another file—such as an HTML file, an XML file, or another SWF file—we may index the contents of those files, but we won't consider that content to be part of the content in your Flash files.
  • We're currently unable to index the bidirectional language content (for example, Hebrew or Arabic) in Flash files.
Note that while Google can index the content of Flash files, other search engines may not be able to. Providing text-based equivalents of these files can help other search engines crawl and index your content. In addition, a text-based version of your site will let viewers using older browsers or mobile phones access your content easily.

You could also consider using sIFR (Scalable Inman Flash Replacement). sIFR (an open-source project) lets webmasters replace text elements with Flash equivalents. Using this technique, content and navigation is displayed by an embedded Flash object but, because the content is contained in the HTML source, it can be read by non-Flash users (including search engines).

You can also improve the indexing of your Flash or rich media application by supporting Google's AJAX crawling scheme. This scheme works for Javascript, but also for Flash and any other browser-side technology.

Silverlight and other rich media formats
Google can crawl and index the text content of Flash files, but we still have problems accessing the content of other rich media formats such as Silverlight. These rich media formats are inherently visual, which can cause some problems for Googlebot. Unlike some Internet spiders, Googlebot can read some rich media files and extract the text and links in them, but the structure and context are missing. Also, rich media designers often include content in the form of graphics, and because Google can't detect words included in graphics, it can miss important keywords. In other words, even if we can crawl your content and it is in our index, it might be missing some text, content, or links.

Googlebot cannot crawl the content of video files, so it's important that you provide information about videos you include. Consider creating a transcript of the video you want to include, or provide a detailed description of the video inside your HTML. If you have video content, you can host it on Google Video, YouTube, or a number of other video hosting providers. Searchers can view Google Video or YouTube videos directly from the Google search results page.

IFrames are sometimes used to display content on web pages. Content displayed via iFrames may not be indexed and available to appear in Google's search results. We recommend that you avoid the use of iFrames to display content. If you do include iFrames, make sure to provide additional text-based links to the content they display, so that Googlebot can crawl and index this content.

Best practices
If you do plan to use rich media on your site, here are some recommendations that can help prevent problems.
  • Try to use rich media only where it is needed. We recommend that you use HTML for content and navigation. This makes your site more Google-friendly, and also makes it accessible to a larger audience including, for example, readers with visual impairments that require the use of screen readers, users of old or non-standard browsers, and users with limited or low-bandwidth connections such as a cellphone or mobile device. An added bonus? Using HTML for navigation will allow users to bookmark content and send direct links in email.

  • Provide text versions of pages. Silverlight is often used as a splash screen on the home page, where the root URL of a website has a rich media intro that links to HTML content deeper into the site. If you use this approach on your website, make sure there is a regular HTML link on that front page to a text-based page where a user (or Googlebot) can navigate throughout your site without the need for rich media.


Popular Posts