gATo and fRiEnDs- are am now working on the Tor-Directory Project crawling about 2000 Tor-url and getting some new information about Tor and the sites that reside in the Dark Web. Example I got a good crawl from a site and I went to double check it and now it was down, so are the sites going up and down and online just for a period of time? Are the site not available because of the browser I am using -vs- my crawler. These are some of the answers I will find out.
I expected due to the slowness of Tor to spend a lot of time running these crawls. I have now a script that I can run in about 20hr or less and scrape about 2000 sites. I thought that the slowness of Tor-Dark Web would make this a real time eater but I am wrong. Another thing is the secret Tor sites I found, I now have a fingerprint on them and these sites that hide in secret on top of being in Tor are a real interest to me and others.
The main issue is Tor is not socks-http friendly so setting up the infrastructure was a real learning curve and now I can replicate the installation so as I get more servers online this will become a little easier. Right now I am mapping the sites so I can crawl every page, the good part and bad is I am finding more and more URL that I never thought existed, so the discovery of new URL is a good thing but once again the collection becomes a real bear.
I am putting this into a db to make the search of the collected data a little easier but finding that db programing on the web is well not very user friendly but I have a good partner that is fixing all my mistakes. We will house this new Tor-only website search engine in the clear web so we can keep the speed up and well people are scared to go into Tor, so why not keep everything in the clearWeb for now.
I expect the crawls to get much longer since I now have the urls to crawl every site a little better but the information and mapping out Tor will be and invaluable tool for us. You say how about the hidden wiki, and all those sites that have Tor directory wiki sites. Well they are OK for basic stuff but I am finding new sites I never heard of and the pedophiles are all over Tor so you best beware I am putting a light on your websites and the next part will be to stop you from using Tor as a play ground for your sick crap. Tor is meant for real needs of privacy and protection and I hope my work in this will get these sick bastards to run somewhere else — gATO is watching you in Tor so beware!!!