Pooya

(Blog Archives: HTML::TreeBuilder)


« Nuvo Humanoid Robot | Main | Spring »

HTML::TreeBuilder

I found HTML::TreeBuilder a useful and easy to use Perl module for filtering out unwanted tags. Just take a look at an example:
I was browsing http://www.hottest-lyrics.com and I found it a good place for downloading lyrics of my favorite singers. The good points are that it has the lyrics categorized for each album and also they can be downloaded by recursive wgets:


$ wget -r http://www.hottest-lyrics.com/s.css
$ wget -k -E -r -np -b -t 0 -l 50 -U \ "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)" \ http://www.hottest-lyrics.com/l/loreena-mckennitt-lyrics-2376.html

After downloading the lyrics, I noticed lots of javascripts and images in HTML sources. So I used the following perl code to clean it out :

#!/usr/bin/perl

use HTML::TreeBuilder;

foreach my $file_name (<STDIN>) {
        chomp $file_name;
        my $tree = HTML::TreeBuilder->new;
        $tree->parse_file($file_name);
        @tags=$tree->look_down('_tag','img');
        push @tags,$tree->look_down('_tag','script');
        push @tags,$tree->look_down('_tag','embed');
        push @tags,$tree->look_down('_tag','object');
        foreach $t (@tags) {
                $t->delete();
        }
        $out=$tree->as_HTML;
        $styleloc=$file_name;
        $styleloc=~s/^\.\///s;
        $styleloc=~s/^\///s;
        $styleloc=~s/[^\/]*\//..\//gis;
        $styleloc=~s/[^\/]*$/s.css/;
        $styleloc=~s/^\.\.\///;
        $out=~s/http:\/\/www.hottest-lyrics.com\/s\.css/$styleloc/gis;
        open FOUT,">".$file_name;
        print FOUT $out;
        close FOUT;
        $tree = $tree->delete;
        print "$file_name finished\n";
}




Posted to Programming by pooya at March 7, 2004 07:18 AM
Comments

Posted by: budowa domów at June 24, 2007 01:05 AM

Good article and site. Congratulations


Posted by: domy drewniane at June 28, 2007 05:45 AM

Nice site. Greetings


Posted by: sklep rowerowy at August 1, 2007 01:33 AM

Good jobs.Thanks.




Post a comment









Remember personal info?




[Security Code]




[Tuesday 01-06-2009  21:06] [Updated Saturday 11-03-2007  00:24]