Monday, 27 May 2013

Scraping a Password Protected Website with cURL

One strategy for quickly populating a webstore with products is to write a scraping script to download data from your supplier’s website. This can be difficult if they require a password login to view product prices, descriptions, etc. There is, however, a solution.

1) Your first step is to obtain a login and password from your supplier or the website you will scrape. You probably already have this.

2) You need to figure out how they process your login information. You can do this by reading the POST variables sent by your browser. What is the easiest way to do this? Well, there is a simple Firefox extension that will do the trick. Check out a program called Tamper Data. Run this extension right before clicking Submit on your login form (it is under the Tools menu in Firefox if you installed it correctly). Then click “Start Tamper”. Now, submit your data.

You will get a screen that looks like this:

These are the results you get from Tamper Data when you login to the website. The POST variables are circled in red.

Now you know your POST variables. Let’s move on to the PHP code.

3) First, you need to write a script to log in to the website. I have done that for you. You just need to modify the URL and the post variables to work with the website that you are scraping. This script will only work if you have cURL installed. If cURL is not included in your web hosting package, get a new web host.
01    function login(){
02        $ch = curl_init();
03        curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/login.asp'); //login URL
04        curl_setopt ($ch, CURLOPT_POST, 1);
05        $postData='
06        txtUserName=brad
07        &txtPassword=fakepassword
08        &txthdbtn=Login
09        &imageField.x=27
10        &imageField.y=8';
11        curl_setopt ($ch, CURLOPT_POSTFIELDS, $postData);
12        curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
13        curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
14        $store = curl_exec ($ch);
15        return $ch;
16    }

This function will allow you to login to the website. The function returns the cURL session and you can use that later on to scrape the protected content. Note that the POST data is defined exactly the same as it was from the Tamper Data results. This is important. The website does not know whether you are a human browser or a computer program accessing its data.

4) Now that you have the cURL session, you need to do something with it. You would use this session the same as if you had ran
1    $ch=curl_init();

except now you have a cURL session that is logged in to the website. One example of you would use this session is to retrieve all of the data from a webpage. Have a look:
01    function downloadUrl($Url, $ch){
02        curl_setopt($ch, CURLOPT_URL, $Url);
03        curl_setopt($ch, CURLOPT_POST, 0);
04        curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/");
05        curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
06        curl_setopt($ch, CURLOPT_HEADER, 0);
07        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
08        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
09        $output = curl_exec($ch);
10        return $output;
11    }

Note that this function does not close the cURL session. If you want to keep using your logged in session, you need to keep it open. Sample utilization of this function to download a specific URL and output it to the screen is:
1    $ch=login();
2    $html=downloadUrl('http://www.example.com/page1.asp', $ch);
3    echo $html;

The complete code looks like this:
view source
print?
01    $ch=login();
02    $html=downloadUrl('http://www.example.com/page1.asp', $ch);
03    echo $html;
04   
05    function downloadUrl($Url, $ch){
06        curl_setopt($ch, CURLOPT_URL, $Url);
07        curl_setopt($ch, CURLOPT_POST, 0);
08        curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/");
09        curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
10        curl_setopt($ch, CURLOPT_HEADER, 0);
11        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
12        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
13        $output = curl_exec($ch);
14        return $output;
15    }
16   
17    function login(){
18        $ch = curl_init();
19        curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/login.asp'); //login URL
20        curl_setopt ($ch, CURLOPT_POST, 1);
21        $postData='
22        txtUserName=brad
23        &txtPassword=fakepassword
24        &txthdbtn=Login
25        &imageField.x=27
26        &imageField.y=8';
27        curl_setopt ($ch, CURLOPT_POSTFIELDS, $postData);
28        curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
29        curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
30        $store = curl_exec ($ch);
31        return $ch;
32    }


Source: http://www.phpcodester.com/2011/01/scraping-a-password-protected-website-with-curl/

No comments:

Post a Comment