Collect old robots.txt files from waybackmachine and download the Disallow paths
Install:
$ go install github.com/ogow/waybackrobots@latest Usage:
Just give the flag -domain a valid domain and it will start downloading all the archived responses from 2015
$ waybackrobots -domain google.com Get all response from a custom year:
$ waybackrobots -domain google.com -fd 2020Sometimes the wayback api can return alot of results which will take along time to download, to avoid this the -strat flag can be used.
The -strat flag takes one of these values day, month, digest, digest is default.
Usually digest will be the go to value, but if we look at a domain like google.com that has been archived alot the digest filter will still return alot of results.
If this is the case we can try to use the day filter which gets one snapshot each day.
filters in use explanation: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#collapsing
comparing digest with day and month
$ go run . -domain google.com -strat digest
[i] found 38261 old robots.txt files
$ go run . -domain google.com -strat day
[i] found 473 old robots.txt files
$ go run . -domain google.com -strat month
[i] found 122 old robots.txt filesHelp:
Usage of waybackrobots:
-domain string
which domain to find old robots for
-fd int
choose date from when to get robots from format: 2015 (default 2015)
-strat string
interval to get robots for, possible values: digest, day, month (default "digest")