September 24, 2008 by Alex Polski

How to develop a good scraper on Perl - Lesson 2

You can find the beginning of this post here - How to develop a good scraper on Perl - Lesson 1

2. Treebuilder class allows you to convert html page to the tree structure and perform the operations with the tree nodes like searching, walking through the nodes etc. If you will add XPath support to the Treebuilder class, you will get very powerful tool for html parsing. Look at the example below.

use HTML::TreeBuilder::XPath;

#create treebuilder object and parse the html code from content variable
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);

#find all nodes 'div->h1' in the tree, div node must have 'class' attribute
#that will match the '/details\d+/i' regular expression
if (my @name_nodes = $tree->findnodes('//div[@class=~/details\d+/i]/h1')) {
  #get the trimmed text value from the first result node
  $name = $name_nodes[0]->as_trimmed_text;
}

You can find the full documentation here: a href=”http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/TreeBuilder.pm”>HTML::TreeBuilder, HTML::TreeBuilder::XPath

The books I recommend:

Share and Enjoy:
  • del.icio.us
  • Digg
  • Reddit
  • Ma.gnolia
  • Technorati
  • Propeller
  • Facebook
  • StumbleUpon
  • Furl
  • blogmarks
  • Google
  • YahooMyWeb
  • E-mail this story to a friend!
This entry was posted on Wednesday, September 24, 2008 at 7:51 am and is filed under Scraping. You can leave a response, or trackback from your own site.

Related posts

« How to develop a good scraper on Perl - Lesson 1

How to develop a good scraper on Perl - Lesson 3 »



One Response to “How to develop a good scraper on Perl - Lesson 2”

Mark
Posted on September 24th, 2008 at 10:34 pm

Great! I’m hoping in part 3 you can give us some code examples for using JavaScript::SpiderMonkey to interpret pages that hide content behind JavaScript!

Leave a Reply