I was requested to demonstrate a basic working of extracting or scraping from webpages. I thought why not make something suitable for my fellow BHTs. So here it is : This is a program made in vb.net (vs2008) to extract the search results from google.com.
http://balliya007.fileave.com/google%20scraper.zip
(it has source code as well as exe file - in bin->debug folder)
1 - Your search term
2 - a web browser control which visits the search page and extracts the links
3 - all the extracted links
HOW DID I:
1) Perform a search : If you look carefully Google uses "q=" for getting a search term. Type "http://google.com/search?q=" and any search term you want. For eg :
"http://google.com/search?q=hubpages"
See that google automatically opens the search results page.
2) Extracting : In vb.net we have a COM (component object model) for IE (internet explorer). It is basically an inbuilt browser component (object of IE) which can be used for browsing operations.
When a user enters the search term and clicks the button, the web page navigates to the search term.
Then we try to collect all the URLs in the search page. Try to view the source of the search page. In firefox goto : View -> Source
You will see that the urls are contained in tag. This is what we want :
Now we use the following code to extract all the urls contained within this tag :
When the user enters a search term and presses the button, the browser object visits the search page. When the page has finished loading, it matches the pattern in following order :Code:Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click         WebBrowser1.Navigate("http://www.google.com/search?q=" & TextBox1.Text)     End Sub     Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted         Dim htmlele As HtmlElementCollection         htmlele = WebBrowser1.Document.GetElementsByTagName("h3")         For Each htm As HtmlElement In htmlele             Dim chld As HtmlElementCollection = htm.GetElementsByTagName("a")             For Each ch As HtmlElement In chld                 RichTextBox1.AppendText(ch.GetAttribute("href") & vbCrLf)             Next         Next     End Sub
1) Find all the tag elements by the name H3
htmlele = WebBrowser1.Document.GetElementsByTagName("h3")
2) Now get all the child elements , which are tag "A", inside this parent element. Then one by one extract the "href" property of this "a" tag.
For Each htm As HtmlElement In htmlele
Dim chld As HtmlElementCollection = htm.GetElementsByTagName("a")
For Each ch As HtmlElement In chld
RichTextBox1.AppendText(ch.GetAttribute("href") & vbCrLf)
Next
Next
3) Store all the urls in a richtext box :
RichTextBox1.AppendText(ch.GetAttribute("href") & vbCrLf)
TADAAAAA : You have extracted all the urls. Now try this. In the browser click on next page and see that it automatically extracts the next 10 results and so on.
Next time I will teach you how to extract number of results entered by the user and then we will discuss the concepts of proxy, multithreading etc.


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks