HTML Screen Scraping using C# .Net WebClient
What is Screen Scraping ?
Screen Scraping
means reading the contents of a web page. Suppose you go to yahoo.com, what you
see is the interface which includes buttons, links, images etc. What we don't
see is the target url of the links, the name of the images, the method used
by the button which can be POST or GET. In other words we don't see the HTML
behind the pages. Screen Scraping pulls the HTML of the web page. This HTML
includes every HTML tag that is used to make up the page.
Why use screen scraping ?
The question that comes to our mind is why do we ever want the HTML of any
web page. Screen Scraping does not stop only on pulling out the HTML but
displaying it also. In other words you can pull out the HTML from any
web page and display that web page on your page. It can be used as frames. But the good
thing about screen scraping is that it is supported by all browsers and frames
unfortunately are not.
Also sometimes you go to a website which has many links which says image1,
image2, image3 and so on. In order to see those images you have to click on the
image and it will enlarge in the parent or the new window. By using screen
scraping you can pull all the images from a particular web page and display them
on your own page.
Displaying a web page on your own page using Screen Scraping :
Lets see a small code snippet which you can use to display any page on your own
page. First make a small interface as I have made below. As you can see
the interface is quite simple. It has a button which says "Display WebPages
below" and the web page trust me or not will be displayed in place of
label. All the code will be written for the Button Click event. Below
you can see the "Button Click Code".
C# Button Click Code :
private void
Button1_Click(object sender, System.EventArgs e)
{
WebClient webClient = new WebClient();
const string strUrl = "http://www.yahoo.com/";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
lblWebpage.Text = objUTF8.GetString(reqHTML);
}
|
Explanation of the Code Snippet in
C#:
As you can see the code is few lines long. This is because Microsoft.net has a
very strong set of class libraries that makes the task easier for the
developer. If you were trying to achieve the same result from classic Asp you
might have to write a lot more code, I guess that's good for all the coders out there in the
programming world.
In the first line I made an object of the WebClient class. The WebClient class
provides common methods for sending data to or receiving data from any local,
intranet, or Internet resource identified by a URI.
In the next line we just defined a private string variable strUrl
which holds the url of the web page we wish to use in our example.
Then we declared a byte array reqHTML which will hold the bytes
transferred from the web page.
Next line downloads the data in the form of bytes and put them in the
reqHTML byte array.
The UTF8Encoding class represents
the UTF-8 encoding of Unicode characters.
And in the next line we use the UTF8Encoding class method
GetString to get
the bytes as a string representation and finally we binds the result to the
label.
This code now gets the www.yahoo.com
homepage when the label is bound with the HTML of
the yahoo page. The whole yahoo page is displayed.
The Generated HTML :
For those curious people who want to see that HTML was generated when the
request was made. You can easily view the HTML by just viewing the source code
of the yahoo page. In our internet explorer go to View -> Source. The
notepad will open with the complete HTML generated of the page. Lets see a
small screen shot of the HTML generated when we visit yahoo.com. As you can see
the HTML generated is quite complex. Wouldn't it be really cool if you can
extract out all the links from the generated source. Lets try to do
that :)
Extracting Urls :
The first thing you need to extract all the Urls from the
web page is the regular
expression. I am not saying you cannot do this without regular expression you
can but it will be much harder.
Regular Expression for Extracting Urls :
First you need to introduce System.Text.RegularExpressions. Next
you need to make a regular expression that can extract all urls from the
generated HTML. There are many regular expressions already made for you which
you can view at http://www.regexlib.com/
. Your regular expression would like this:
Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]*
))");
This just says that extract everything from the
web page source which starts with
"href\\"
User Interface in Visual Studio .Net:
I am keeping user interface pretty simple. It consist of a textbox, datagrid
and button. The datagrid will be used to display all the extracted urls.
Here is a screen shot of the User Interface.
The Code:
Okay the code is implemented in the button click event. But before that lets
see the important declarations. You also need to include the following
namespaces:
System.Net;
System.Text;
System.IO // If you plan to write in a file
// creates a button
protected System.Web.UI.WebControls.Button Button1;
// creates a byte array
private byte[] aRequestHTML;
// creates a string
private string myString = null;
// creates a datagrid
protected System.Web.UI.WebControls.DataGrid DataGrid1;
// creates a textbox
protected System.Web.UI.WebControls.TextBox TextBox1;
// creates the label
protected System.Web.UI.WebControls.Label Label1;
// creates the arraylist
private ArrayList a = new ArrayList();
|
Okay now lets see some button click code that does the actual work.
private void Button1_Click(object sender, System.EventArgs e)
{
// make an object of the WebClient class
WebClient objWebClient = new WebClient();
// gets the HTML from the url written in the textbox
aRequestHTML = objWebClient.DownloadData(TextBox1.Text);
// creates UTf8 encoding object
UTF8Encoding utf8 = new UTF8Encoding();
// gets the UTF8 encoding of all the html we got in aRequestHTML
myString = utf8.GetString(aRequestHTML);
// this is a regular expression to check for the urls
Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]*
))");
// get all the matches depending upon the regular expression
MatchCollection mcl = r.Matches(myString);
foreach(Match ml in mcl)
{
foreach(Group g in ml.Groups)
{
string b = g.Value + "
";
// Add the extracted urls to the array list
a.Add(b);
}
}
// assign arraylist to the datasource
DataGrid1.DataSource = a;
// binds the databind
DataGrid1.DataBind();
// The following lines of code writes the extracted Urls to the file
named test.txt
StreamWriter sw = new
StreamWriter(Server.MapPath("test.txt"));
sw.Write(myString);
sw.Close();
}
|
The MatchCollection mc1 has all the extracted urls and you can iterate through the collection to get all of them. Once you enter the url in the textbox and press the button the datagrid will be populated with the extracted urls. Here is a screen shot of the datagrid. The screen shot only shows few urls extracted there are at least 50 of them.
Final Note:
As you see that its simple to extract urls from any
web page. You can also make
the Column in the datagrid a hyperlink column so you can browse the extracted
url.