Hi all…

Do you also believe in “Wisdom of crowd”???

Well I do, that’s why before buying products online or buying tickets of a movie I always go for their reviews. Well you need to show your wisdom too. But its always good to listen. Reviews are rich source of information and one of the best thing we can do is mining sentiments from the reviews and then you can probably take a call.

Or you can toss a coin also. Kidding…

So are you interested in doing Sentiment Analysis (will show it in my next post) on reviews about Oscar winning movie “Spotlight” ???

Wondering how to get the reviews???

Well don’t worry. Let’s see how we can get them from IMDB using a technique called  “Web Crawling”.  This technique lets you retrieve specific elements from web pages of a site. So let’s get started…

Step 1: Install and load the required packages

install.packages(“RCurl”)
install.packages(“XML”)
install.packages(“httr”)
install.packages(“rvest”)
install.packages(“httpuv”)

library(RCurl)
library(XML)
library(httr)
library(rvest)
library(httpuv)

Step 2: Specify the URL of the reviews page

Specify the URL and the getURL function will download the entire content of that URL into the active R session.

URL=”http://www.imdb.com/title/tt1895587/reviews?ref_=tt_ql_3″
d=getURL(URL)

Step 3: Retain relevant content

Create a parse tree from the document. Since the content is a HTML doc and we will need data under some specific tags. So to make those tags easily identifiable within R we create a parse tree.

doc=htmlParse(d)

Now we are interested in extracting reviews which is spread over numbered navigation links. The links are embedded in ‘a’ tag and path is specified in href. This is how we extract them.

list=getNodeSet(doc,”//a”)
list[1:3]
list_href=sapply(list,function(x)xmlGetAttr(x,”href”))
View(list_href)

href

So here are some links extracted from entire downloaded document. We need those which contains reviews.

page_link1=grep(“reviews\\?”,list_href)
requiredpagelinks=list_href[page_link]
requiredpagelinks=unique(requiredpagelinks)
View(requiredpagelinks)

grep command gives us indices of navigation links of reviews from list_href and we fetch them using the indices. Have a look at the result:

links

Here comes the actual extraction of reviews from all the selected navigation links. However the list you saw above doesn’t contain the complete URLs. We initialize the following variables to make them complete URLs.

crawlCandidate=”reviews\\?”
base=”http://www.imdb.com/title/tt0120338/reviews?ref_=tt_ov_rt”

Initialize the number of pages to view.  Let us visit 20 such pages. And if you closely go through the code in loop, it is exactly similar what we did so far. It just that we are doing same steps for 20 pages here.

num=20
#initialize the variables
doclist=list()
anchorlist=vector()

j=0
while(j<num)
{
       if(j==0)
       {
         doclist[j+1]=getURL(URL)
       }
      else
      {
         doclist[j+1]<-getURL(paste(base,anchorlist[j+1],sep=””))
      }
     doc=htmlParse(doclist[[j+1]])
     anchor=getNodeSet(doc,”//a”)
     anchor=sapply(anchor,function(x)xmlGetAttr(x,”href”))
     anchor=anchor[grep(crawlCandidate,anchor)]
     anchorlist=c(anchorlist,anchor)
     anchorlist=unique(anchorlist)
     j=j+1
}

When the loop exhausts you will find complete contents of 20 pages in the doclist object. Then in the following loop we pick the much awaited reviews embedded in <div> followed by <p> tag (Remember this tag arrangement can vary from source of pages from sites to sites, you should yourself find out in which tags the text is written. You can do it by enabling by :

Right Click on one of the review page->Inspect

Look at the following image to understand it better:

Inspect

reviews=c()
for(i in 1:num)
{
    doc=htmlParse(doclist[[i]])
    l=getNodeSet(doc,”//div/p”)
    l1=html_text(l)
    r=l1[nchar(l1)>200]
    reviews=c(reviews,r)
}

Therefore, we have extracted 110 reviews from 20 pages stored in “reviews” variable.

reviews

Let us have a look at few such reviews.

reviewsResult

Now that you have reviews in hand. You can continue with your sentiment analysis exercise. 

Thats’ all Folks!!!

We will meet next time with Sentiment Analysis exercise. Till then, Happy Learning…

 

 

Advertisements

2 thoughts on “Web Crawling Using R

    • Sorry Nidhi for replying so late.
      Well that’s actually a typo. U can replace that with URL variable which is declared with a path. Let me know if that solves your problem

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s