WWW-Crawler-Mojo

WWW::Crawler::Mojo is a web crawling framework written in Perl on top of mojo toolkit, allowing you to write your own crawler rapidly.

This software is considered to be alpha quality and isn't recommended for regular usage.

Features

Requirements

Usage

use WWW::Crawler::Mojo;

my $bot = WWW::Crawler::Mojo->new;

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    $scrape->() if (...); # collect URLs from this document
});

$bot->on(refer => sub {
    my ($bot, $enqueue, $job, $context) = @_;
    
    $enqueue->() if (...); # enqueue this job
});

$bot->enqueue('http://example.com/');
$bot->crawl;

Installation

$ cpanm WWW::Crawler::Mojo

Documentation

Examples

Restrict enqueuing URLs by depth.

$bot->on(refer => sub {
    my ($bot, $enqueue, $job, $context) = @_;
    
    $enqueue->() if ($job->depth < 5);
});

Restrict enqueuing URLs by host.

$bot->on(refer => sub {
    my ($bot, $enqueue, $job, $context) = @_;
    
    $enqueue->() if $job->resolved_uri->host eq 'example.com';
});

Restrict enqueuing URLs by referrer's host.

$bot->on(refer => sub {
    my ($bot, $enqueue, $job, $context) = @_;
    
    $enqueue->() if $job->referrer->resolved_uri->host eq 'example.com';
});

Excepting enqueuing URLs by path.

$bot->on(refer => sub {
    my ($bot, $enqueue, $job, $context) = @_;
    
    $enqueue->() unless ($job->resolved_uri->path =~ qr{^/foo/});
});

Restricting following URLs by host on response event.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    $scrape->() if ($job->resolved_uri->host eq 'example.com');
});

Other examples

Copyright (C) jamadam

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.