Archive for the ‘Scraping’ Category
January 25, 2009 by Alex Polski | Comments Off
Scrapy framework
Paul Maunders said me about cool scraping and crawling framework based on python – Scrapy. I’ve tried it and developed about 20 spiders. It’s really nice stuff but it would be great if scrapy’s developers will add some more features:
- Rules for form submitting
- Proxy server support
October 9, 2008 by Alex Polski | Comments Off
How to develop a good scraper on Perl – Lesson 3
You can find the beginning of this post here – How to develop a good scraper on Perl – Lesson 1 and How to develop a good scraper on Perl – Lesson 2
If your scraper works as just one thread it can be very slow to scrape a large site for a short time. Perl has a set of classes that will help you to run as many threads as you want simultaneously:
use threads;
#run 'scrape' function in 10 threads
my @threads;
for (1..10) {
my $thr = threads->create(\&scrape);
$thr->detach;
push @threads, $thr;
}
You can share the variables between threads:
use threads::shared;
my $counter :shared = 0;
sub scrape {
...
{
#lock 'counter' variable and modify it's value
lock $counter;
$counter++;
}
#here 'counter' variable will be unlocked
...
}
It’s recommended to start one additional thread which will control the other ones:
sub control {
while (1) {
for (my $i = 0; $i < 10; $i++) {
#check if the threads are alive
if ($threads[$i] eq undef || !$threads[$i]->is_running) {
#if some threads are dead, run them again
print "Some of the threads is stopped! Rerunning...\n";
$threads[$i] = threads->create(\&scrape);
$threads[$i]->detach unless ($threads[$i] eq undef);
}
sleep(1);
}
}
}
And the last tasty thing is thread queues. You can just add urls to the queue and your threads will get them from queue or wait if the queue is empty:
use Thread::Queue;
#create queue
my $data_queue = Thread::Queue->new();
sub scrape {
#get url from the queue or wait if the queue is empty
while (my $params = $data_queue->dequeue) {
my $url = $params->[0];
...
}
}
#add url to the queue
$data_queue->enqueue(['http://www.example.com/']);
You can find the documentation for all these classes here: threads class, threads tutorial, threads::shared, Thread::Queue.
The books I recommend:
September 24, 2008 by Alex Polski | 1 Comment »
How to develop a good scraper on Perl – Lesson 2
You can find the beginning of this post here – How to develop a good scraper on Perl – Lesson 1
2. Treebuilder class allows you to convert html page to the tree structure and perform the operations with the tree nodes like searching, walking through the nodes etc. If you will add XPath support to the Treebuilder class, you will get very powerful tool for html parsing. Look at the example below.
use HTML::TreeBuilder::XPath;
#create treebuilder object and parse the html code from content variable
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
#find all nodes 'div->h1' in the tree, div node must have 'class' attribute
#that will match the '/details\d+/i' regular expression
if (my @name_nodes = $tree->findnodes('//div[@class=~/details\d+/i]/h1')) {
#get the trimmed text value from the first result node
$name = $name_nodes[0]->as_trimmed_text;
}
You can find the full documentation here: a href=”http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/TreeBuilder.pm”>HTML::TreeBuilder, HTML::TreeBuilder::XPath
The books I recommend:
August 29, 2008 by Alex Polski | 1 Comment »
How to develop a good scraper on Perl – Lesson 1
If you want to develop a good scraper Perl can be very good solution for you. It has all you need for these purposes: mechanize library, treebuilder class and threads support.
1. Mechanize library is a complex library for automating interaction with websites. It completely simulates user’s activity like clicking on links and submitting forms and has a lot of another useful features. Let’s look at the code:
use WWW::Mechanize;
#create mechanize object
my $mech = WWW::Mechanize->new();
#set user agent string
$mech->agent('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1');
#go to http://www.example.com/
$mech->get('http://www.example.com/');
#click on link 'Some text'
$mech->follow_link(text => 'Some text');
#fill and submit the form
$mech->submit_form(
form_name => 'search',
fields => { query => 'Some text' },
button => 'Search Now'
);
Of course, there were basic features used in the example above, you can find full documentation here.
The books I recommend:






















