+ Reply to Thread + Post New Thread
Page 1 of 2 12 LastLast
Results 1 to 10 of 18

Thread: google search scraper - a basic tutorial

  1. #1
    Junior SEO Specialist
    Join Date
    Jan 2011
    Posts
    130
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    I was requested to demonstrate a basic working of extracting or scraping from webpages. I thought why not make something suitable for my fellow BHTs. So here it is : This is a program made in vb.net (vs2008) to extract the search results from google.com.

    http://balliya007.fileave.com/google%20scraper.zip

    (it has source code as well as exe file - in bin->debug folder)



    1 - Your search term
    2 - a web browser control which visits the search page and extracts the links
    3 - all the extracted links


    HOW DID I:

    1) Perform a search : If you look carefully Google uses "q=" for getting a search term. Type "http://google.com/search?q=" and any search term you want. For eg :

    "http://google.com/search?q=hubpages"

    See that google automatically opens the search results page.

    2) Extracting : In vb.net we have a COM (component object model) for IE (internet explorer). It is basically an inbuilt browser component (object of IE) which can be used for browsing operations.

    When a user enters the search term and clicks the button, the web page navigates to the search term.

    Then we try to collect all the URLs in the search page. Try to view the source of the search page. In firefox goto : View -> Source

    You will see that the urls are contained in tag. This is what we want :




    Now we use the following code to extract all the urls contained within this tag :


    Code:
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            WebBrowser1.Navigate("http://www.google.com/search?q=" & TextBox1.Text)
        End Sub
    
        Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
            Dim htmlele As HtmlElementCollection
            htmlele = WebBrowser1.Document.GetElementsByTagName("h3")
            For Each htm As HtmlElement In htmlele
                Dim chld As HtmlElementCollection = htm.GetElementsByTagName("a")
                For Each ch As HtmlElement In chld
                    RichTextBox1.AppendText(ch.GetAttribute("href") & vbCrLf)
                Next
            Next
        End Sub
    When the user enters a search term and presses the button, the browser object visits the search page. When the page has finished loading, it matches the pattern in following order :

    1) Find all the tag elements by the name H3

    htmlele = WebBrowser1.Document.GetElementsByTagName("h3")


    2) Now get all the child elements , which are tag "A", inside this parent element. Then one by one extract the "href" property of this "a" tag.


    For Each htm As HtmlElement In htmlele
    Dim chld As HtmlElementCollection = htm.GetElementsByTagName("a")
    For Each ch As HtmlElement In chld
    RichTextBox1.AppendText(ch.GetAttribute("href") & vbCrLf)
    Next
    Next


    3) Store all the urls in a richtext box :

    RichTextBox1.AppendText(ch.GetAttribute("href") & vbCrLf)



    TADAAAAA : You have extracted all the urls. Now try this. In the browser click on next page and see that it automatically extracts the next 10 results and so on.

    Next time I will teach you how to extract number of results entered by the user and then we will discuss the concepts of proxy, multithreading etc.

  2. Shorten URL    SEO Services    Buy Xrumer

    Sponsored Links

  3. #2
    Junior SEO Specialist
    Join Date
    Jan 2011
    Posts
    130
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    P.S. : Lets keep this code and tutorial on this forum only. Do not copy paste and pass it to others. Its a humble request. We should have something different from "others"

  4. #3
    Registered Member
    Join Date
    Jan 2011
    Posts
    81
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    Hah - this is a coincidence: today I started making a bot with Ubot Studio that scrapes links from Google Search and I was just working out that I had to scrape href to get the URLS. :0

    Thanks for sharing. Although I have very little knowledge of VB.NET ATM to be certain I rather suspect after reading your post that Ubot is written using VB.NET.

  5. #4
    Junior SEO Specialist
    Join Date
    Jan 2011
    Posts
    130
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    i have absolutely no idea about programming tool used to create Ubot. I have not tried it also. I find vb.net a lot easier to follow And with Watin module i was able to create an application to post articles to 6 web2.0 sites. This is kickass. Plus i have also made an article spinner software using vb.net.

  6. #5
    Noobie
    Join Date
    Jan 2011
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    i see a
    "user Account Excceded Bandwidth

    This account is not valid."
    please reupload a working link please... thanks

  7. #6
    Noobie
    Join Date
    Jan 2011
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    Hi Vicky,
    Nice share.. I am struggling to learn VB.NET and having lots of problems.... is it possible that you could share some of your other code that would help us all become better vb.net programmers like your self.... or may you can point us to a web that has a huge amount of sample vb.net already there for us..

    Thanks

  8. #7
    Junior SEO Specialist
    Join Date
    Jan 2011
    Posts
    130
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    ahhh sorry about that:

    http://www.mediafire.com/?uugxl191kvpy4w9

    here you go.

  9. #8
    Junior SEO Specialist
    Join Date
    Jan 2011
    Posts
    130
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    hi thanks for your responses.

    @tommytx : you can start like i did. I was familiar with C++ before starting with .net. But anyways start with sams teach yourself vb.net in 24 hrs. It will take you approx. 3-4 days to get acquainted with it. Next go gor sams teach yourself vb.net in 21 days. And after this go for wiley or wrox publications for vb.net. These are very detailed so study sams first. Should take you around 2-3 months to cover them all.

    P.S. A little googling will get you all the books mentioned above.

  10. #9
    Junior SEO Specialist
    Join Date
    Dec 2010
    Posts
    168
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    @Vicky007

    How do you stop the scraper from continuing extracting? There is no stop feature.

    If you can allow the program window to be re-sizable to be able to adjust the columns, it would be perfect.

  11. #10
    Noobie
    Join Date
    Jan 2011
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default google search scraper - a basic tutorial

    hi vicky, how to add the code without clicking the "next page" link?

+ Reply to Thread
Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts