YTCrawler Release 1

The YouTube Crawler is a project in progress and the software on this page is pre-release and subject to change. There are no guarantees for this software and you can use it on your own risk. The YouTube Crawler is free software and is licensed under the terms of GNU General Public License.

Contents

  1. Architecture overview
  2. Database clustering
  3. Using the crawler
  4. Understanding video feeds
  5. Setup guide
  6. Code information
  7. Download software

Go to top | Go to contents

I. Architecture Overview

The YouTube Crawler is a software application that can be used to crawl video information from YouTube. The crawler can capture both static information such as videos' title, author, duration, publishing date, etc, as well as dynamic information such as videos' view count, number of comments, rating, etc. Because static and dynamic video information is usually recorded in a different manner (is sufficient to crawl static data only once, whereas dynamic data should be crawled periodically), the YouTube Crawler has three components.

The crawler uses an Oracle database to store the crawled video information. The following figure illustrates the principles of the design.

YouTube Crawler architecture

The Master Crawler

The master is responsible for initial crawling of YouTube videos and recording the static video information. Video discovery is acheieved using any of the standard or category based YouTube video feeds. For example, standard feeds include most popular, most recent, most viewed videos, etc; category-based feeds include music, people, games, howto, etc. You can configure any video feed combination for the master crawling.

In addition, for a crawling deployment, you can use any number of masters in order to load-balance the crawling activity. If you need to identify the data crawled by an individual master, you can preconfigure each master with a unique identifier that will be stored in the database along with the video information.

The Slave Crawler

The slave is used to crawl dynamic video data. Like the master, multiple instances of the slave can be used at the same time to load-balance the crawling activity, each slave being assigned a unique identifier. For this purpose, each master should be configured with the identifiers of the available slaves, and will automatically assign new crawled videos in a balanced manner.

Tip: You can provide redundancy in the crawled data, by assigning the same identifier to two or more slaves.

The Processing Agent

This feature is under development.

The agent has the role of processing live the crawled data. Such processing will include assigning different crawling schedule to videos that meet certain criteria. For example, in a deployment scenario with two slaves, one can be configured to crawl every hour dynamic information for videos that have at least one view per day, while the second slave can be configured to crawl every six hours dynamic information for videos that have less than one view per day. Using the crawled data, the agent will verify periodically the change in view count and can assign each video to the corresponding slave.

Other tasks of the agent will be to verify the crawled data, eliminate errors, perform additional post-processing.

Go to top | Go to contents

II. Database Clustering

This is a new feature that will be made available in release 2.

Go to top | Go to contents

III. Using the Crawler

The master, the slave and the processing agent are implemented within the same software application. However, you can use any number of running instances, on any number of computers, to accomodate the necessary number of masters, slaves and processing agents for your scenario.

Typically, you must follow these steps to setup a crawling scenario:

Configure the Connection to the Database

To configure the connection to the Oracle database, you have to enter the name of the database (see the setup section for details), username and password. You can test the connection by connecting to the database at any time. However, establishing an initial connection is not necessary, as the software will automatically connect whenever needed.

Configuring the database connection

Configure the Master and the Slave

Before you start the crawling you must configure the following information:

  • General crawler parameters: maximum number videos and the number of slaves
  • Database parameters: the name of the database tables you use for crawling
  • Video discovery parameters: if the base and global video feeds are crawled (see video feeds for more information)
  • Video feeds: specify which of the standard feeds are crawled
  • Master configuration: the crawling period, default video schedule (whether the video is active, i.e. crawled by a slave, or not), master identifier, maximum number of videos for one crawl, the collection of slaves (the master will assign the videos to them in a uniform manner)
  • Slave configuration: the crawling period, a number of crawling filters, slave identifier, number of parallel crawling operations, number of retries in case of errors

Setup Video Categories

Optional, you can configure any selection of video categories (see video feeds for more information).

Change Video Feeds

Optional, you can also modify the order of the video feeds (see video feeds for more information).

Start the Master

Start the master crawler. The master will run periodically with the setup schedule until stopped.

Start the Slave

Start the slave crawler. The slave will run periodically with the setup schedule until stopped. To enhance performance (i.e. increase the crawling speed), the slave will spawn the configured number of parallel threads to perform the crawling.

View Your Data

This feature may be under future development.

You can use the YouTube Crawler to analyze your data, either in real time during the crawling, or at the end, after the crawling has completed. Currently, you can view the following information:

  • The static information of all videos collected by the master (a number of filters is available to sort out only the information of interest)
  • The dynamic snapshots for the videos, collected by the slave
  • Statistical information on the number of crawled videos over the crawling period
  • Statistical information on the minimum and maximum view count for crawled videos

Tip: You can automatically locate the dynamic data snapshots for a video by using the 'View snapshots' button.

Go to top | Go to contents

IV. Understanding Video Feeds

For the master crawling, you have to configure video feeds. The YouTube Crawler uses two main types of videos feeds: standard feeds and category feeds. Standard feeds are associated with certain classes of videos, such as the most popular, the most views, the most recent, etc. The following table shows the set of standard feeds currently supported by YouTube and by the YouTube Crawler (see here additional details).

Feed Information
Top rated URL: http://gdata.youtube.com/feeds/api/standardfeeds/top_rated
Description: This feed contains the most highly rated YouTube videos.
Top favorites URL: http://gdata.youtube.com/feeds/api/standardfeeds/top_favorites
Description: This feed contains videos most frequently flagged as favorite videos.
Most viewed URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_viewed
Description: This feed contains the most frequently watched YouTube videos.
Most popular URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_popular
Description: This feed contains the most popular YouTube videos, selected using an algorithm that combines many different signals to determine overall popularity.
Most recent URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_recent
Description: This feed contains the videos most recently submitted to YouTube.
Most discussed URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_discussed
Description: This feed contains the YouTube videos that have received the most comments.
Most responded URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_responded
Description: This feed contains YouTube videos that receive the most video responses.
Recently featured URL: http://gdata.youtube.com/feeds/api/standardfeeds/recently_featured
Description: This feed contains videos recently featured on the YouTube home page or featured videos tab.
Videos for mobile phones URL: http://gdata.youtube.com/feeds/api/standardfeeds/watch_on_mobile
Description: This feed contains videos suitable for playback on mobile devices.

In addition to these standard feeds there exists a global videos feed:

Feed Information
Global URL: http://gdata.youtube.com/feeds/api/videos

Category feeds are defined by the user, based on the category keywords the user selects. In practice, these feeds should match actual video categories from YouTube, but you are allowed to define any combination of categories. The following table contains some examples of category feeds.

Category Information
Movies URL: http://gdata.youtube.com/feeds/api/videos/-/Movies
Entertainment URL: http://gdata.youtube.com/feeds/api/videos/-/Entertainment
Music URL: http://gdata.youtube.com/feeds/api/videos/-/Music

For crawling you are allowed to every combination of standard and category feeds. For instance, if you select the standard feeds most popular and most recent, with the categories Movies and Music, the YouTube Crawler will crawl the following categories:

Category Information
Most popular movies URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_popular/-/Movies
Most popular music URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_popular/-/Music
Most recent movies URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_recent/-/Movies
Most recent music URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_recent/-/Music

In addition, you may also select to crawl the global feed for the selected standard feeds, which will add the following feeds to the crawling:

Category Information
Most popular URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_popular
Most recent URL: http://gdata.youtube.com/feeds/api/standardfeeds/most_recent

Finally, you may also select to crawl the base feed for the selected categories, which will add the following feeds to the crawling:

Category Information
Movies URL: http://gdata.youtube.com/feeds/api/videos/-/Movies
Music URL: http://gdata.youtube.com/feeds/api/videos/-/Music

Go to top | Go to contents

V. Setup Guide

System Requirements

To install and run the YouTube Crawler, you need the following:

  • Operating system: Microsoft Windows XP SP3 or newer
  • Oracle Instant Client for Windows (you can download the client from Oracle, free of charge, subject to accepting the Oracle Technology Network license agreement)
  • Microsoft .NET Framework 3.5 (contact me if you wish to run YouTube Crawler on an earlier version of .NET Framework)

In addition, you need an Oracle database. You can download from Oracle for Linux x86 or Windows, free of charge, Oracle 10g Express Edition, subject to accepting the Oracle Technology Network license agreement. For development purposes, you can also download Oracle 11g.

Install and Setup the Oracle Database

(this page is not a guidline for Oracle, although the process is straightforward most of the time; use Oracle documentation for this step)

Install and Setup the Oracle Client

(this page is not a guidline for Oracle, although the process is straightforward and easy most of the time; use Oracle documentation for this step)

An easy way to connect from a client to an Oracle database is to configure and use TNS names, by modifying (or creating if necessary) the tnsnames.ora file in the Network/Admin folder of your installation. A sample file is usually available in the Samples subfolder. For example, your tnsnames.ora file should look something like this:

my_database =
(
  DESCRIPTION =
  (ADDRESS = (PROTOCOL = TCP)(HOST = database_IP_address)(PORT = database_port))
  (CONNECT_DATA = (SERVER = DEDICATED)(SERVICE_NAME = database_name))
)

Usually, the port is 1521. You use my_database as the database name in the YouTube Crawler.

Create the Database Tables

The YouTube Crawler uses two tables, one for static videos data, one for dynamic snapshot data. The master writes information to the videos table, the slave writes information to the snapshot table. The name of the tables can be customized and configured in the crawler. However, the structure of the tables is fixed in this version of the crawler.

The videos table contains the following columns:

Videos Table
Column Name Type Description
ID CHAR (11) YouTube video identifier
TITLE VARCHAR2 (256) Video title
AUTHOR VARCHAR2 (256) Video author
DURATION INTEGER Video duration in seconds
CATEGORY VARCHAR2 (256) Video category
PUBLISHED DATE Video published date/time
VIEWS INTEGER Number of views at the time of crawling
COMMENTS INTEGER Number of comments at the time of crawling
RATING NUMBER (20,10) Rating at the time of crawling
CRAWL_MASTER INTEGER The identifier of the master
CRAWL_TYPE INTEGER The crawling type (see below)
CRAWL_SLAVE INTEGER The identifier of the assigned slave
CRAWL_SCHEDULE INTEGER The type of slave crawling schedule: active or inactive
CRAWL_STRING VARCHAR2 (256) The crawling string (see below)
CRAWL_ERRORS INTEGER The number of crawling errors (see below)
CRAWL_ERROR INTEGER The crawling error (see below)
CRAWL_FIRST DATE The date/time of the first crawling
CRAWL_LAST DATE The date/time of the last crawling

Note 1: If a video is crawled by a master (or multiple masters) several times, the original information is always kept, except for the CRAWL_LAST field, which is updated with the time of the last crawl.

Note 2: The CRAWL_TYPE field is always zero (0) in the pre-release version of the YouTube Crawler. The value represents feed crawling, which is the only option available at this time. For the release version, the YouTube Crawler will also suport crawling related videos, response videos, user subscribed videos, user favorite videos, user playlists videos. The CRAWL_TYPE field will have different values for these types of crawling.

Note 3: The CRAWL_STRING field contains the feed URL used to crawl the video, in the pre-release version of the YouTube Crawler that supports only feed crawling (type 0). For the release version, in addition the CRAWL_STRING field will contain the identifier of the video used to crawl related and response videos, and the username used to crawl subscribed, favorite and playlist videos.

Note 4: The CRAWL_ERRORS field indicates the number of video parameters that were missing in the YouTube response.

Note 5: The CRAWL_ERROR field indicates the video parameters that were missing in the YouTube response. The value of this field is a binary map to all video parameters except the video identifier, as indicated by the following table (0 indicates success, 1 indicates error):

Bit from LSB Missing Video Parameter
0 Title
1 Author
2 Duration
3 Category
4 Publised
5 Views
6 Comments
7 Rating

The snapshot table contains the following columns:

Snapshots Table
Column Name Type Description
TIMESTAMP DATE The date/time of the crawling
VIDEO CHAR (11) YouTube video indentifier
VIEWS INTEGER Number of views at the time of crawling
COMMENTS INTEGER Number of comments at the time of crawling
RATING NUMBER (20,10) Rating at the time of crawling
SLAVE INTEGER The identifier of the slave
ERRORS INTEGER The number of crawling errors (see below)
ERROR INTEGER The crawling error (see below)
RETRIES INTEGER The number of retries (see below)

Note 1: The CRAWL_ERRORS field indicates the number of video parameters that were missing in the YouTube response.

Note 2: The CRAWL_ERROR field indicates the information that is missing in the YouTube response. The value of this field is a binary map to the missing information, as indicated by the following table (0 indicates success, 1 indicates error):

Bit from LSB Missing Video Information
0 All information: YouTube did not return a response to the crawling request. When this bit is set to one (1), all other bits are set to zero (0), and the number of errors is set to one (1).
1 Views
2 Comments
3 Rating

Note 3: When any of the video parameters or the video itself was missing in the YouTube response, the slave will automatically retry a limited number of times (by default, 2) to crawl the video during the same crawling session. The RETRIES field indicates the number of retries.

Install YouTube Crawler

Download and use the setup MSI file to install YouTube Crawler on your computer. The setup program should detect any dependencies that are missing and you will be prompted to install them. Once you complete this step you can start YouTube Crawler and you are good to go.

Go to top | Go to contents

VI. Code Information

Code information is not yet availabe for the pre-release version.

Go to top | Go to contents

VII. Download Software

You can download the software installer for the pre-release version of YouTube Crawler.

Download installer (MSI/2.50 MB)

For the pre-release version, the code is not yet available for download. Contact me, if you wish to obtain the open source code from the latest development build.

Go to top | Go to contents

Last updated: May 26, 2010