Archive for October, 2008
October 9, 2008 by Alex Polski | No Comments »
How to develop a good scraper on Perl - Lesson 3
You can find the beginning of this post here - How to develop a good scraper on Perl - Lesson 1 and How to develop a good scraper on Perl - Lesson 2
If your scraper works as just one thread it can be very slow to scrape a large site for a short time. Perl has a set of classes that will help you to run as many threads as you want simultaneously:
use threads;
#run 'scrape' function in 10 threads
my @threads;
for (1..10) {
my $thr = threads->create(\&scrape);
$thr->detach;
push @threads, $thr;
}
You can share the variables between threads:
use threads::shared;
my $counter :shared = 0;
sub scrape {
...
{
#lock 'counter' variable and modify it's value
lock $counter;
$counter++;
}
#here 'counter' variable will be unlocked
...
}
It’s recommended to start one additional thread which will control the other ones:
sub control {
while (1) {
for (my $i = 0; $i < 10; $i++) {
#check if the threads are alive
if ($threads[$i] eq undef || !$threads[$i]->is_running) {
#if some threads are dead, run them again
print "Some of the threads is stopped! Rerunning...\n";
$threads[$i] = threads->create(\&scrape);
$threads[$i]->detach unless ($threads[$i] eq undef);
}
sleep(1);
}
}
}
And the last tasty thing is thread queues. You can just add urls to the queue and your threads will get them from queue or wait if the queue is empty:
use Thread::Queue;
#create queue
my $data_queue = Thread::Queue->new();
sub scrape {
#get url from the queue or wait if the queue is empty
while (my $params = $data_queue->dequeue) {
my $url = $params->[0];
...
}
}
#add url to the queue
$data_queue->enqueue(['http://www.example.com/']);
You can find the documentation for all these classes here: threads class, threads tutorial, threads::shared, Thread::Queue.
The books I recommend:














