Search
Latest topics
Who is online?
In total there are 4 users online :: 0 Registered, 0 Hidden and 4 Guests None
Most users ever online was 38 on Sun Mar 19, 2023 10:07 pm
Most Viewed Topics
[Perl] robots.txt webcrawler
Page 1 of 1
[Perl] robots.txt webcrawler
Hello,
This is a program takes in a URL in the format "[You must be registered and logged in to see this link.] and crawls to every new domain that it finds copying their robots.txt file. Note that it doesn't dig very deep because it only looks at the source for the front page of a website for new domains. If it was looking at theHackersozne.forumotion.com, for example, it would not look in [You must be registered and logged in to see this link.]
Theoretically, a robots.txt file tells webcrawlers what portions of their website they can and can't index, etc... There's nothing that actually enforces this, but it's supposed to be convention. This is useful because web administrators put things in there that they don't want to show up on a Google search which can mean that information held within is sensitive.
I tested my program on "[You must be registered and logged in to see this link.] which yielded:
[You must be registered and logged in to see this image.]" />
And if you look in one of these you'll see something like:
[You must be registered and logged in to see this image.]" />
Finally, here is the code:
This is a program takes in a URL in the format "[You must be registered and logged in to see this link.] and crawls to every new domain that it finds copying their robots.txt file. Note that it doesn't dig very deep because it only looks at the source for the front page of a website for new domains. If it was looking at theHackersozne.forumotion.com, for example, it would not look in [You must be registered and logged in to see this link.]
Theoretically, a robots.txt file tells webcrawlers what portions of their website they can and can't index, etc... There's nothing that actually enforces this, but it's supposed to be convention. This is useful because web administrators put things in there that they don't want to show up on a Google search which can mean that information held within is sensitive.
I tested my program on "[You must be registered and logged in to see this link.] which yielded:
[You must be registered and logged in to see this image.]" />
And if you look in one of these you'll see something like:
[You must be registered and logged in to see this image.]" />
Finally, here is the code:
- Code:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use utf8;
my(@domains, $pageContent, $i, $e, $new, $robotsContent);
print "Enter domain to start with: (Ex. \"www.google.com\")\n";
chomp($domains[0] = <stdin>);
for($i = 0;$i<scalar(@domains);$i++){
$pageContent = lc(get("http://".$domains[$i]));
while($pageContent =~ /href=\"(.*?)\"/g){
if($1 =~ /http:\/\/(.*?)\// or $1 =~ /https:\/\/(.*?)\//){
$new = 1;
for($e = 0;$e<scalar(@domains);$e++){
if($domains[$e] eq $1){
$new = 0;
}
}
if($new){
push(@domains, $1);
}
}
}
$robotsContent = get("http://".$domains[$i]."/robots.txt");
if($robotsContent){
$robotsContent = lc($robotsContent);
open FILE, ">$domains[$i] robots.txt" or die "Error: $!\n";
binmode(FILE, ":utf8");
print FILE $robotsContent;
close FILE;
}
}
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum
Tue Feb 02, 2021 7:12 am by manas41
» SQL injection and Quote escaping
Sun Jun 28, 2015 11:42 am by ADS1
» [TUT] Chmod: Files & Permissions [TUT]
Thu Jun 04, 2015 12:45 pm by Guest
» Reaver pixiewps
Thu Jun 04, 2015 12:23 pm by voidfletcher
» How To Crash Someone's Skype in 10 SECONDS
Thu Jun 04, 2015 12:20 pm by voidfletcher
» Internet Security & IP Security (IPSec)
Mon May 18, 2015 9:00 pm by voidfletcher
» [Python] Infinite / Definite File Generator
Mon May 18, 2015 8:58 pm by ADS1
» [C#] String Case-Inversion
Mon May 18, 2015 8:57 pm by ADS1
» Rekall Memory Forensic Framework
Sat May 16, 2015 8:55 pm by ADS1