October 9, 2008 by Alex Polski

How to develop a good scraper on Perl – Lesson 3

You can find the beginning of this post here – How to develop a good scraper on Perl – Lesson 1 and How to develop a good scraper on Perl – Lesson 2

If your scraper works as just one thread it can be very slow to scrape a large site for a short time. Perl has a set of classes that will help you to run as many threads as you want simultaneously:

use threads;

#run 'scrape' function in 10 threads
my @threads;
for (1..10) {
  my $thr = threads->create(\&scrape);
  $thr->detach;
  push @threads, $thr;
}

You can share the variables between threads:

use threads::shared;

my $counter :shared = 0;

sub scrape {
  ...
  {
    #lock 'counter' variable and modify it's value
    lock $counter;
    $counter++;
  }
  #here 'counter' variable will be unlocked
  ...
}

It’s recommended to start one additional thread which will control the other ones:

sub control {
  while (1) {
    for (my $i = 0; $i < 10; $i++) {
      #check if the threads are alive
      if ($threads[$i] eq undef || !$threads[$i]->is_running) {
        #if some threads are dead, run them again
        print "Some of the threads is stopped! Rerunning...\n";
        $threads[$i] = threads->create(\&scrape);
        $threads[$i]->detach unless ($threads[$i] eq undef);
      }
      sleep(1);
    }
  }
}

And the last tasty thing is thread queues. You can just add urls to the queue and your threads will get them from queue or wait if the queue is empty:

use Thread::Queue;

#create queue
my $data_queue = Thread::Queue->new();

sub scrape {
  #get url from the queue or wait if the queue is empty
  while (my $params = $data_queue->dequeue) {
    my $url = $params->[0];
    ...
  }
}

#add url to the queue
$data_queue->enqueue(['http://www.example.com/']);

You can find the documentation for all these classes here: threads class, threads tutorial, threads::shared, Thread::Queue.

The books I recommend:

Share and Enjoy:
  • Sphinn
  • del.icio.us
  • Digg
  • Reddit
  • Slashdot
  • Technorati
  • Propeller
  • Facebook
  • StumbleUpon
  • LinkedIn
  • blogmarks
  • Google Bookmarks
  • Live
  • MisterWong
  • MySpace
  • Netvibes
  • Yahoo! Buzz
  • Twitter
  • Yahoo! Bookmarks
  • Identi.ca
  • E-mail this story to a friend!
This entry was posted on Thursday, October 9, 2008 at 1:25 pm and is filed under Scraping. Both comments and pings are currently closed.

Related posts

« How to develop a good scraper on Perl – Lesson 2

PHP account adsense monitor »