Do you also believe in “Wisdom of crowd”???
Well I do, that’s why before buying products online or buying tickets of a movie I always go for their reviews. Well you need to show your wisdom too. But its always good to listen. Reviews are rich source of information and one of the best thing we can do is mining sentiments from the reviews and then you can probably take a call.
Or you can toss a coin also. Kidding…
So are you interested in doing Sentiment Analysis (will show it in my next post) on reviews about Oscar winning movie “Spotlight” ???
Wondering how to get the reviews???
Well don’t worry. Let’s see how we can get them from IMDB using a technique called “Web Crawling”. This technique lets you retrieve specific elements from web pages of a site. So let’s get started…
Step 1: Install and load the required packages
Step 2: Specify the URL of the reviews page
Specify the URL and the getURL function will download the entire content of that URL into the active R session.
Step 3: Retain relevant content
Create a parse tree from the document. Since the content is a HTML doc and we will need data under some specific tags. So to make those tags easily identifiable within R we create a parse tree.
Now we are interested in extracting reviews which is spread over numbered navigation links. The links are embedded in ‘a’ tag and path is specified in href. This is how we extract them.
So here are some links extracted from entire downloaded document. We need those which contains reviews.
grep command gives us indices of navigation links of reviews from list_href and we fetch them using the indices. Have a look at the result:
Here comes the actual extraction of reviews from all the selected navigation links. However the list you saw above doesn’t contain the complete URLs. We initialize the following variables to make them complete URLs.
Initialize the number of pages to view. Let us visit 20 such pages. And if you closely go through the code in loop, it is exactly similar what we did so far. It just that we are doing same steps for 20 pages here.
#initialize the variables
When the loop exhausts you will find complete contents of 20 pages in the doclist object. Then in the following loop we pick the much awaited reviews embedded in <div> followed by <p> tag (Remember this tag arrangement can vary from source of pages from sites to sites, you should yourself find out in which tags the text is written. You can do it by enabling by :
Right Click on one of the review page->Inspect
Look at the following image to understand it better:
for(i in 1:num)
Therefore, we have extracted 110 reviews from 20 pages stored in “reviews” variable.
Let us have a look at few such reviews.
Now that you have reviews in hand. You can continue with your sentiment analysis exercise.
Thats’ all Folks!!!
We will meet next time with Sentiment Analysis exercise. Till then, Happy Learning…