OSx -Tor Web Crawler Project

OSx Curl .onion sites -how 2 guide- Tor Web Crawler Project

gATO hAs – been looking into mapping the Tor -.onion network crawling it from aA to zZ , from 1-7 all 16 digits. I use OSx for most of my work and I wanted to curl an .onion site and check it out. As I dug around I found that if I just check my Vidalia.app it will show me were everything is located. Then the fun begins

find your /TorBrowser_en-US-6.app then click and look at the file Info  then go to: TorBrowser_en-US-6.app/Contents/MacOS/

cd – TorBrowser_en-US-6.app/Contents/MacOS/

once here :

– this will show you the files

ls -fGo 

total 5976

drwxr-xr-x  7 richardamores      238 Jun  8 07:11 .

drwxr-xr-x  7 richardamores      238 Feb 19 06:54 ..

drwxr-xr-x  3 richardamores      102 Feb 19 06:54 Firefox.app

-rwxr-xr-x  1 richardamores  3045488 Feb 19 06:54 tor

-rwxr-xr-x  1 richardamores     1362 Feb 19 06:54 TorBrowserBundle

drwxr-xr-x  4 richardamores      136 Feb 19 06:54 Vidalia.app

-rw-r–r–  1 richardamores     6435 Jun  8 07:11 VidaliaLog-06.08.2012.txt

Now I fire up the tor application ./tor

Next open up another Terminal box and check to see if Tor port is open and LISTENing on port 9050

netstat -ant | grep 9050 # verify Tor is running

Once you can see port 9050 LISTEN then your ready to use curl—

curl -ivr –socks4a 127.0.0.1:9050 http://utup22qsb6ebeejs.onion/

curl –ivr –socks4a 127.0.0.1:9050 http://nwycvryrozllb42g.onion  

curl -ivr –socks4a 127.0.0.1:9050  http://2qd7fja6e772o7yc.onion/

curl -ivr –socks4a 127.0.0.1:9050 http://5onwnspjvuk7cwvk.onion/

curl -ivr –socks4a 127.0.0.1:9050 http://6sgjmi53igmg7fm7.onion/

curl -ivr –socks4a 127.0.0.1:9050 http://6vmgggba6rksjyim.onion/

Here are a few site that you can check out:../ curl is just one of those tools that keeps on giving and of course if I can get one APP to work thru Tor on OSx, then I can get other apps to use Tor as a proxy for all my line command –time to have some fun- gATO oUt

Lab -Notes

  1. sudo apt-get install tor
  2. sudo /etc/init.d/tor start
  3. netstat -ant | grep 9050 # verify Tor is running

here is a good crawler  to play with

<?php

$ch = curl_init(‘http://google.com&#8217;);

curl_setopt($ch, CURLOPT_HEADER, 1);

curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);

curl_setopt($ch, CURLOPT_PROXY, ‘https://127.0.01:9050/&#8217;);

curl_exec($ch);

curl_close($ch);

<?php

$ch = curl_init(‘http://google.com&#8217;);

curl_setopt($ch, CURLOPT_HEADER, 1);

curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);

// Socks5

curl_setopt($ch, CURLOPT_PROXY, “localhost:9050”);

curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);

curl_exec($ch);

curl_close($ch);

Tor Web Crawler

http://stackoverflow.com/questions/9237477/tor-web-crawler

did not work – netstat shows it on socks4 not socks5

curl -s –socks5-local 127.0.0.1:9050 –user-agent “Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3” -I http://utup22qsb6ebeejs.onion/

turn on ToR

Run  /Users/gatomalo/Downloads/TorBrowser_en-US-6.app/Contents/MacOS/tor

cd /Users/gatomalo/Downloads/TorBrowser_en-US-6.app/Contents/MacOS

./tor

now check for 9050 running proxy

netstat -ant | grep 9050

Now run your network commands thru socks port 9050

./Users/gatomalo/Downloads/TorBrowser_en-US-6.app/Contents/MacOS/tor

ls -fGo

total 5976

drwxr-xr-x  7 richardamores      238 Jun  8 07:11 .

drwxr-xr-x  7 richardamores      238 Feb 19 06:54 ..

drwxr-xr-x  3 richardamores      102 Feb 19 06:54 Firefox.app

-rwxr-xr-x  1 richardamores  3045488 Feb 19 06:54 tor

-rwxr-xr-x  1 richardamores     1362 Feb 19 06:54 TorBrowserBundle

drwxr-xr-x  4 richardamores      136 Feb 19 06:54 Vidalia.app

-rw-r–r–  1 richardamores     6435 Jun  8 07:11 VidaliaLog-06.08.2012.txt

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

curl -S –socks5-hostname 127.0.0.1:9050 -I http://utup22qsb6ebeejs.onion/

HTTP/1.1 200 OK

Date: Thu, 12 Jul 2012 17:49:49 GMT

Server: Apache/2.2.22 (Ubuntu)

X-Powered-By: PHP/5.3.10-1ubuntu3.2

Set-Cookie: fpsess_fp-a350e65d=8hg0upuuhcpuf4pgvg45l9c2b2; path=/

Expires: Thu, 19 Nov 1981 08:52:00 GMT

Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0

Pragma: no-cache

Vary: Accept-Encoding

Transfer-Encoding: chunked

Content-Type: text/html; charset=utf-8

<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Strict//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”&gt;

<html xmlns=”http://www.w3.org/1999/xhtml”&gt;

<head>

<title>My Hidden Blog</title>

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />

<!– start of jsUtils –>

<script type=”text/javascript” src=”http://utup22qsb6ebeejs.onion/fp-plugins/jquery/res/jquery-1.4.2.min.js”></script&gt;

<script type=”text/javascript” src=”http://utup22qsb6ebeejs.onion/fp-plugins/jquery/res/jquery-ui-1.8.2.custom.min.js”></script&gt;

<!– end of jsUtils –>

<!– FP STD HEADER –>

<meta name=”generator” content=”FlatPress fp-0.1010.1″ />

<link rel=”alternate” type=”application/rss+xml” title=”Get RSS 2.0 Feed” href=”http://utup22qsb6ebeejs.onion/?x=feed:rss2&#8243; />

<link rel=”alternate” type=”application/atom+xml” title=”Get Atom 1.0 Feed” href=”http://utup22qsb6ebeejs.onion/?x=feed:atom&#8221; />

<!– EOF FP STD HEADER –>

<!– FP STD STYLESHEET –>

<link media=”screen,projection,handheld” href=”http://utup22qsb6ebeejs.onion/fp-interface/themes/leggero/leggero/res/style.css&#8221; type=”text/css” rel=”stylesheet” /><link media=”print” href=”http://utup22qsb6ebeejs.onion/fp-interface/themes/leggero/leggero/res/print.css&#8221; type=”text/css” rel=”stylesheet” />

<!– FP STD STYLESHEET –>

Some other curl switches =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

–connect-timeout <seconds>

Maximum time in seconds that you allow the connection to the server to take.  This only limits the con-

nection  phase,  once  curl  has  connected  this  option is of no more use. See also the -m/–max-time

option.

 

If this option is used several times, the last one will be used.

 

-D/–dump-header <file>

Write the protocol headers to the specified file.

 

This  option  is handy to use when you want to store the headers that a HTTP site sends to you. Cookies

from the headers could then be read in a second curl invocation by using the  -b/–cookie  option!  The

-c/–cookie-jar option is however a better way to store cookies.

 

When  used  in  FTP,  the  FTP  server response lines are considered being “headers” and thus are saved

there.

 

If this option is used several times, the last one will be used.

 

 

-f/–fail

(HTTP)  Fail silently (no output at all) on server errors. This is mostly done to better enable scripts

etc to better deal with failed attempts. In normal cases when a HTTP server fails to  deliver  a  docu-

ment,  it returns an HTML document stating so (which often also describes why and more). This flag will

prevent curl from outputting that and return error 22.

 

This method is not fail-safe and there are occasions where  non-successful  response  codes  will  slip

through, especially when authentication is involved (response codes 401 and 407).

 

 

 

–ssl

(FTP,  POP3,  IMAP, SMTP) Try to use SSL/TLS for the connection.  Reverts to a non-secure connection if

the server doesn’t support SSL/TLS.  See also –ftp-ssl-control and –ssl-reqd for different levels  of

encryption required. (Added in 7.20.0)

 

This  option  was  formerly known as –ftp-ssl (Added in 7.11.0) and that can still be used but will be

removed in a future version.

 

-H/–header <header>

(HTTP)  Extra  header to use when getting a web page. You may specify any number of extra headers. Note

that if you should add a custom header that has the same name as one of the internal  ones  curl  would

use,  your externally set header will be used instead of the internal one. This allows you to make even

trickier stuff than curl would normally do. You should not replace internally set headers without know-

ing perfectly well what you’re doing. Remove an internal header by giving a replacement without content

on the right side of the colon, as in: -H “Host:”.

 

curl will make sure that each header you add/replace is sent with the proper  end-of-line  marker,  you

should thus not add that as a part of the header content: do not add newlines or carriage returns, they

will only mess things up for you.

 

See also the -A/–user-agent and -e/–referer options.

 

This option can be used multiple times to add/replace/remove multiple headers.

 

-o/–output <file>

Write output to <file> instead of stdout. If you are using {} or [] to fetch  multiple  documents,  you

can  use ‘#’ followed by a number in the <file> specifier. That variable will be replaced with the cur-

rent string for the URL being fetched. Like in:

 

curl http://{one,two}.site.com -o “file_#1.txt”

 

or use several variables like:

 

curl http://{site,host}.host[1-5].com -o “#1_#2”

 

You may use this option as many times as the number of URLs you have.

 

See also the –create-dirs option to create the local directories dynamically. Specifying the output as

‘-‘ (a single dash) will force the output to be done to stdout.

 

-r/–range <range>

(HTTP/FTP/SFTP/FILE) Retrieve a byte range (i.e a partial document) from a HTTP/1.1, FTP or SFTP server

or a local FILE. Ranges can be specified in a number of ways.

 

0-499     specifies the first 500 bytes

 

500-999   specifies the second 500 bytes

 

-500      specifies the last 500 bytes

9500-     specifies the bytes from offset 9500 and forward

 

0-0,-1    specifies the first and last byte only(*)(H)

 

500-700,600-799

specifies 300 bytes from offset 500(H)

 

100-199,500-599

specifies two separate 100-byte ranges(*)(H)

 

 

 -v/–verbose

Makes  the fetching more verbose/talkative. Mostly useful for debugging. A line starting with ‘>’ means

“header data” sent by curl, ‘<‘ means “header data” received by curl that is hidden  in  normal  cases,

and a line starting with ‘*’ means additional info provided by curl.

 

Note  that if you only want HTTP headers in the output, -i/–include might be the option you’re looking

for.

 

If you think this option still doesn’t give you enough details, consider using –trace or –trace-ascii

instead.

 

This option overrides previous uses of –trace-ascii or –trace.

 

Use -s/–silent to make curl quiet.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: