Making Inferences About Population Parameter : Part 2

Hi everyone…

This post is continuation of my earlier post. In that post I have tried to cover basic concepts and terminologies used in estimation of population parameters. So if you are reading this first time I would recommend you to have a look at the earlier post.

So let us move ahead. Here we will discuss how to estimate population mean using sample mean (point estimate) w.r.t two situations

(a) when population standard deviation is known(using z-statistic).
(b) when population standard deviation is unknown(using t-statistic).

Case 1: When population standard deviation is known

Problem in hand : Suppose a large cellular phone company wants to meet the needs of cell phone users and for that they hire a research company to estimate the number of texts used per month by Americans in the 18-24 years of age category.

The research company studies the phone records of 85 randomly sampled Americans in the mentioned age category and computes a sample monthly mean of 1300 texts. Now we want to find the population mean.

Solution: If the cellular company uses the sample mean of 1300 texts as an estimate for the population mean, the sample mean is used as a point estimate. So let us look at the formula of computing z score to estimate the population mean. Here the level of confidence is 95% or alpha is 0.05

z = (x-μ)/σ

But since we are using point estimate we use standard error instead of using standard deviation directly. Standard error describes the typical error or uncertainty associated with the estimate.

Therefore the formula stands as:

z = (x-μ)/SE
z = (x-μ)/(σ/√n)

Rearranging this formula we get,

μ = x – (z *(σ/√n))

since we took our level of confidence as 95%, the level of error(α)permissible is 5%. Because a sample mean can be greater than or less than population mean, z can be negative or positive. So we would calculate the area on both sides of shaded areas from the center point 0.  You can refer to the following diagram.

Therefore we can re-write the above expression as:

μ = x ± (Zα/2*(σ/√n))  equation(a)

CI-diagram

The value of Zα/2 or Z0.025 is found by looking in the standard normal table under 0.5000-0.0250 = 0.4750. This area in the table is associated with a z-value of 1.96

Also, suppose past history and similar studies indicate that population standard deviation is about 160.

Therefore,

x = 1300
Zα/2 = 1.96
σ = 160
n = 85

Putting all these values in the above equation we get,

μ = 1300 ± (1.96*(160/√85))

Solving it we will get,
1265.99 ≤  μ  ≤ 1334.01 (This is our Confidence Interval)

So, the cellular telephone company researcher is 95% confident that the average number of texts per month by an American in the 18-24 year age category is between 1265.99 and 1334.01

To add one little thing which is very important is that, if your population is infinitely large and your sample size is less than 5% of your population you can use the equation(a) as it is, but if it is more than 5%, you should always one factor to the SE term called the “finite correction factor”.  Equation(a) can be modified as :

μ = x ± (Zα/2*(σ/√n)*(√((N-1)/(N-n)))) equation(b)

where, N = size of the population.

In our above example, our population size was unknown and the sample size considered seems to be less than 5% of population size.

Finite Correction Factor reduces the width of the Confidence Interval. Less the width of Confidence Interval the more you tend towards accurate value of the population parameter.

Now, let us move on to the next case.

 

Case 2 : When population standard deviation is unknown

Suppose you are conducting your study for the first time and you do not have value for population standard deviation from studies conducted in the past. So how do you find the population parameter(here mean).

Yes, as we used sample mean as a point estimate, similarly we can use sample standard deviation as another point estimate. Everything else in equation(a) remains same. Instead of z-statistic we call it t-statistic. One assumption for using t-statistic is that the population should be normally distributed.

The formula for t-statistic is :

t = (x-μ)/(s/√n)

where,

x = sample mean
μ = population mean(to be estimated)
s = sample standard deviation
n = sample size

As we had z-table to calculate the values of z associated with some value of α, we also have a t-table here. To find a t-value associated with some area under t-curve, we need to understand one more term called “degrees of freedom”.

Degrees of freedom: It refers to the number of independent observations for a source of variation minus the number of independent parameters estimated in computing the variation.

Here, one independent parameter, population mean, μ, is being estimated by x (sample mean) in computing s. Thus, the degrees of freedom formula is n independent observations minus one independent parameter being estimated (n-1).

Therefore, the above equation can be re-written in terms of confidence interval as:

μ = x ± (tα/2,n-1*(s/√n)) equation(c)

Let us look at the following problem statement to understand how we can find population mean using t-statistic.

Problem: In aerospace industry some companies allow their employees to accumulate extra working hours beyond their 40 hour week. These extra hours are sometimes referred to as comp time. Suppose a researcher wants to estimate the average amount of comp time accumulated and per week for managers in the aerospace industry. The researcher randomly samples 18 managers and measures the amount of extra time they work during specific work. He constructs a 90% confidence interval to estimate the average amount of extra time per week worked by a manager in the aerospace industry.

Solution:
Sample size, n = 18
Degrees of freedom, n-1 = 17
α = 0.10, α/2 =0.05
x = 13.56 hours
s = 7.80 hours

From t-table in the above link we get

t0.05,17=1.740

Putting all the above values in equation(c) we get,

μ = 13.56 ± (1.740*(7.80/√18))
μ = 13.56 ± 3.20

10.36 ≤ μ ≤ 16.76

This means the researcher can be 90% confident that the average number of extra hours per week worked by a manager in the aerospace industry lies within 10.36 and 16.76 hours.

With this we come to an end of our post. I hope it was of some value to you. Please feel free to suggest or comment on this post. Give and take of knowledge always helps one to constantly improve.

Till we meet next time, Happy Learning!!!

References : Applied Business Statistics by Ken Black
Openintro.org

Making Inferences About Population Parameters : Part 1

Hi all…

This time I thought of writing about “Inferential Statistics”.

Well this is very interesting topic which every aspiring Data Scientist must know about.

The ability to estimate population parameters (say mean, standard deviation, variance etc..)  or to test hypotheses about population parameters using sample statistics is one of the main applications of statistics in improving decision making business. Whether estimating parameters or testing hypotheses about parameters, the inferential process consists of taking a random sample from a group or body (the population), analyzing data from the sample, and reaching conclusions about the population using sample data.

But before diving into the main concepts let us understand some basic terminologies:

  1. Parameter : a descriptive measure of the population is called a “parameter”. They are generally denoted by Greek letters. Examples of parameters are population mean (μ), population variance (σ2), and the population standard deviation (σ)
  2. Statistic : a descriptive measure of the sample is called a “statistic“.  They are usually denoted by Roman letters. Example of statistics are sample mean(x ̅) and sample variance(s2) and sample standard deviation (s).
  3. z-score: a z-score represents the number of standard deviations a value (x) is above or below a mean of a set of numbers when the data is normally distributed. Using z scores allows translation of a value’s raw distance from mean into units of standard deviations. Remember z-scores are also used for standardization when the data belongs to various scales or units of measurement. For example: Consider the following table
    SAT ACT
    Mean 1500 21
    Standard deviation 300 5

    The above table shows the mean and standard deviation for total scores on the SAT and ACT examinations. The distribution of SAT and ACT scores are both normal.

    Suppose Ann scored 1800 on her SAT and Tom scored 24 on his ACT. Who performed better???

    Since, the scores of Ann and Tom comes from two different different distributions we cannot compare them as it is as we need some common ground of comparison. z-score helps us here. Let us look at the formula of z-score first:

    z-score=(x-μ)/σ
    z-score of Ann = (1800-1500)/300 = 1
    z-score of Tom = (24-21)/5 = 0.6

    Now here, we can see that Ann is 1 s.d. above average on the SAT and Tom is 0.6 s.d. above the average on the ACT. Therefore, Ann tends to do better w.r.t to everyone else than Tom did, so her score was better.

  4. Point estimate: a point estimate is a statistic(mean, median, s.d etc) from a sample that is used to estimate a population parameter. A point estimate is only as good as the representativeness of its sample. If other random samples are taken from the population, the point estimates derived from those random samples would vary. Because of variation in sample statistics, estimating a population parameter with an interval estimate is often preferable to using a point estimate. That interval is called confidence interval.
  5. Confidence Interval: it is a range of values within which the analyst can declare, with some confidence, the population parameter lies.Using only a point estimate is like fishing  in a murky lake with a spear, and using a confidence interval is like fishing with a net. We can throw a spear where we saw a fish, but we will probably miss. On the other hand, if we toss a net in that area, we have a good chance of catching the fish.Sometime later in your study you might here a term called 95% confidence interval. What it actually means is that if you draw 100 random samples from a population and built a confidence interval from each sample, then about 95% of those intervals would contain the actual mean μ.
  6. Standard error(SE):  the standard deviation associated with an estimate is called the standard error. It describes the typical error or uncertainty associated with the estimate.Computing SE for the sample mean :Given n independent observations from a population with standard deviation σ, the standard error of the sample mean is equal toSE = σ/sqrt(n)A reliable method to ensure sample observations are independent is to conduct a simple random sample consisting of less than 10% of population. However, there is one subtle issue in the above equation, the population standard deviation is unknown. But we can use the point estimate of the standard deviation of the sample when the sample size is at least 30 and the population distribution is not skewed.

Now let us look at the following chart to understand which statistic to use in which scenarios:

1

From the diagram we can see that when we have known population stddev we can use z statistic to estimate the confidence interval for population mean otherwise we can use t-statistic.

In this post I am keeping it up to the basics. In the next post, I will discuss in detail how to compute the confidence interval for above mentioned scenarios.

Till then, Happy Learning….

References : Applied Business Statistics by Ken Black
Openintro.org

Web Crawling Using R

Hi all…

Do you also believe in “Wisdom of crowd”???

Well I do, that’s why before buying products online or buying tickets of a movie I always go for their reviews. Well you need to show your wisdom too. But its always good to listen. Reviews are rich source of information and one of the best thing we can do is mining sentiments from the reviews and then you can probably take a call.

Or you can toss a coin also. Kidding…

So are you interested in doing Sentiment Analysis (will show it in my next post) on reviews about Oscar winning movie “Spotlight” ???

Wondering how to get the reviews???

Well don’t worry. Let’s see how we can get them from IMDB using a technique called  “Web Crawling”.  This technique lets you retrieve specific elements from web pages of a site. So let’s get started…

Step 1: Install and load the required packages

install.packages(“RCurl”)
install.packages(“XML”)
install.packages(“httr”)
install.packages(“rvest”)
install.packages(“httpuv”)

library(RCurl)
library(XML)
library(httr)
library(rvest)
library(httpuv)

Step 2: Specify the URL of the reviews page

Specify the URL and the getURL function will download the entire content of that URL into the active R session.

URL=”http://www.imdb.com/title/tt1895587/reviews?ref_=tt_ql_3″
d=getURL(URL)

Step 3: Retain relevant content

Create a parse tree from the document. Since the content is a HTML doc and we will need data under some specific tags. So to make those tags easily identifiable within R we create a parse tree.

doc=htmlParse(d)

Now we are interested in extracting reviews which is spread over numbered navigation links. The links are embedded in ‘a’ tag and path is specified in href. This is how we extract them.

list=getNodeSet(doc,”//a”)
list[1:3]
list_href=sapply(list,function(x)xmlGetAttr(x,”href”))
View(list_href)

href

So here are some links extracted from entire downloaded document. We need those which contains reviews.

page_link1=grep(“reviews\\?”,list_href)
requiredpagelinks=list_href[page_link]
requiredpagelinks=unique(requiredpagelinks)
View(requiredpagelinks)

grep command gives us indices of navigation links of reviews from list_href and we fetch them using the indices. Have a look at the result:

links

Here comes the actual extraction of reviews from all the selected navigation links. However the list you saw above doesn’t contain the complete URLs. We initialize the following variables to make them complete URLs.

crawlCandidate=”reviews\\?”
base=”http://www.imdb.com/title/tt0120338/reviews?ref_=tt_ov_rt”

Initialize the number of pages to view.  Let us visit 20 such pages. And if you closely go through the code in loop, it is exactly similar what we did so far. It just that we are doing same steps for 20 pages here.

num=20
#initialize the variables
doclist=list()
anchorlist=vector()

j=0
while(j<num)
{
       if(j==0)
       {
         doclist[j+1]=getURL(URL)
       }
      else
      {
         doclist[j+1]<-getURL(paste(base,anchorlist[j+1],sep=””))
      }
     doc=htmlParse(doclist[[j+1]])
     anchor=getNodeSet(doc,”//a”)
     anchor=sapply(anchor,function(x)xmlGetAttr(x,”href”))
     anchor=anchor[grep(crawlCandidate,anchor)]
     anchorlist=c(anchorlist,anchor)
     anchorlist=unique(anchorlist)
     j=j+1
}

When the loop exhausts you will find complete contents of 20 pages in the doclist object. Then in the following loop we pick the much awaited reviews embedded in <div> followed by <p> tag (Remember this tag arrangement can vary from source of pages from sites to sites, you should yourself find out in which tags the text is written. You can do it by enabling by :

Right Click on one of the review page->Inspect

Look at the following image to understand it better:

Inspect

reviews=c()
for(i in 1:num)
{
    doc=htmlParse(doclist[[i]])
    l=getNodeSet(doc,”//div/p”)
    l1=html_text(l)
    r=l1[nchar(l1)>200]
    reviews=c(reviews,r)
}

Therefore, we have extracted 110 reviews from 20 pages stored in “reviews” variable.

reviews

Let us have a look at few such reviews.

reviewsResult

Now that you have reviews in hand. You can continue with your sentiment analysis exercise. 

Thats’ all Folks!!!

We will meet next time with Sentiment Analysis exercise. Till then, Happy Learning…

 

 

Document Classification Using R

Hi…

In this post, I would like to share the concepts of Document Classification, its need and applications and how can we do it in R. So let’s get started…

Have you ever wondered how “gmail or yahoo” very intelligently decides to put which messages in your inbox and which messages should be directed to spam folder.

Or, suppose you work in a library where you are given 100 e-books to be updated into respective catalogs. And suppose these e-books belong to 10 different subjects. Then will you read every e-book (say only the 1st page) and then update into its respective catalog category.

What if you just provide the content of the these files to a black-box engine that smartly classifies the 100 e-books into the categories/subjects defined by you.

This is what Document Classification does.

To the engine you give some couple of statements with proper category tags and then the engine intelligently learns which statement comes from which category depending on the unique words in each statement.

For example :

1. If a document belongs to astronomy, then you may find words like space, satellite, comets, stars, galaxy etc.

2. If a document belongs to mathematics, then words like partial differentiation, algebra, theorems, integration, mensuration etc can be found.

Therefore, the engine creates a document term matrix where  the columns will contain unique words found in all the documents and the rows will represent documents(statements). Each cell will have a value either 1 if the word is present in that document or 0 if not.

So the engine breaks your documents into a binary document-term(words) matrix.

When a new unseen document comes to the engine, it tries to find how similar is this statement to other statements and based upon the distance it classifies it into a specific category.

Here is an example:

Suppose these are your 4 statements. Then the corresponding Document-Term matrix will be like:

And suppose an unseen statement “I like bright sunny morning” shows up. Then the document-term matrix for this statement will look like:

So the level of match of this unseen sentence with all others can be computed by the similarity between each binary vector of previous sentences and the binary vector for this sentence.

Sentence 1: 010010000
Unseen:     011100001
Matches:    3

Sentence 2: 100000010
Unseen:     011100001
Matches:    3

Sentence 3: 001100001
Unseen:        011100001
Matches:    8

Sentence 4: 000001100
Unseen:      011100001
Matches:    1

Clearly, we can see that this sentence is very similar to sentence 3 and hence it should belong to category “Weather”.

This example is just to explain the concept. Of course, there can be other techniques of Document Classification as well.

Now let us see how we can do this exercise in R

The data which I am using are the reviews (since we need text data) of Mobile phones, Books and Pen-drive bought by people from Flipkart and has been collected by a technique called Web-Crawling (which I will share how to do that in a separate post).

Let’s begin.


Code:

books=read.csv(“books.csv”)
mobile=read.csv(“mobile.csv”)
pendrive=read.csv(“pendrive.csv”)

Let’s have a look at few reviews from each class.

Class 1: Mobile

Class 2: Book

Class 3: Pen-drive

#Let us prepare the data                                                         data=rbind(mobile,books,pendrive)                                                                                 table(data$Label)

We have 577 review of  mobile, 1148 of books and 589 of pen-drive bought by people through Flipkart.

#And now lets prepare the train and test data

sample=sample(1:nrow(data),round(0.7*nrow(data)),replace=F)  

trainData=data[sample,]
testData=data[-sample,]

#import the library
library(RTextTools)

There is one thing that needs to be changed in following function,to avoid an error during execution.(Mandatory).And this change is valid till your session is active.You need to make this change every time you work on a new session.trace commands opens the source code of create_matrix function and we need to change one simple thing. 

Just replace A by a in the word Acronym.

Yes, it is silly, but if this is not done you cannot complete this exercise. And trust me I have spent hours and hours to debug this problem.


trace(create_matrix,edit=T)

#Save the changes and you are done.
#Now lets create a doc-term matrix of trainData

mat1=create_matrix(trainData$Reviews,language = “en”,removeNumbers = T,removeSparseTerms=0.998, removeStopwords = T)

con1=create_container(matrix = mat1, labels = trainData$Label, trainSize = 1:1500, testSize1501:1620, virgin = F)

 

Here let us understand what is happening. First, we created a Document-Term matrix with the arguments passed above. Then the created matrix is fed in for creating a list of objects called container. This container contains internally 1500 statements for training the model and 120 statements for validation. (This is in-sample validation and hence we set Virgin flag = False

specifying that the sample has class labels).

#Then train models (Here we have used 4 algorithms)


svm=train_model(con1,”SVM”)
rf=train_model(con1,”RF”)                                                            nnet=train_model(con1,”NNET”)                                                       slda=train_model(con1,”SLDA”)

#classify data (1501: 1620) specified in con1, for which engine assumes, data has class values


cl_svm1=classify_model(con1,svm)                                                     cl_rf1=classify_model(con1,rf)                                                       cl_nnet=classify_model(con1,nnet)                                                   cl_slda=classify_model(con1,slda)

#create a combined result.
cl_anal1=create_analytics(con1,classification_results = cbind(cl_svm1,cl_rf1,cl_nnet,cl_slda))

#See the document summary(this gives us predicted classes)View(cl_anal1@document_summary)

The picture might not be that clear. Here with respect to each algorithm we have the predicted labels with their probabilities. Then we have a “Manual Code” value which is the original class value. Then we have “Consensus_Code” which all the algorithm collectively decides upon 1 predicted class value. Therefore, if we create a confusion matrix of these 2 variables we can certainly get our accuracy value:

Accuarcy =  ((29+54+29)/120)*100                                                

=  93.33%


 

#Now we need to classify the data points in the test dataset using already trained models                                                                                                                                                           cl_svm2=classify_model(con2,svm)
cl_rf2=classify_model(con2,rf)
cl_nnet2=classify_model(con2,nnet)
cl_slda2=classify_model(con2,slda)

#Create a combined result
cl_anal2=create_analytics(con2,classification_results = cbind(cl_svm2,cl_rf2,cl_nnet2,cl_slda2))

#See the document summary(this gives us predicted classes)
View(cl_anal2@document_summary)

#Add Target Variable to the document summary
testData_Result=data.frame(cl_anal2@document_summary,”Target Class”=testData$Label)

Let’s view a snapshot of result classification…Result

#To get the efficacy we can create confusion matrix using algorithm’s consensus. But #most of the time it is gives misleading consensus even though you can see individual #predicted values of all algorithms are in line with actual values.

#So instead of using algorithm’s consensus, we will use consensus based on the #following logic…

col=ncol(testData_Result)+1
for (i in 1:nrow(testData_Result))
{
counter=0
for (j in 1:4)
{
if(testData_Result[i,j]==testData_Result$Target.Class[i])
{
counter=counter+1
}
}
#we say, if more than two algorithms agree on same class, then the predicted class should #be equal to actual Target class.
#or choose SVM’s class as predicted class.
if(counter>2)
{
testData_Result[i,col]=testData_Result$Target.Class[i]
}
else
{
testData_Result[i,col]=testData_Result$SVM_LABEL[i]
}
}
names(testData_Result)[col]=”Predicted_Class”

#It’s time to look at the confusion matrix to see the efficacy of the classification

Result

So we can see that the accuracy of this model is 95.7% and if you notice our consensus logic gives better accuracy than the consensus used during training data set.

This is one example of Document Classification. There may be other solutions as well.

Feel free to post any query or suggestion.

Happy Learning till then…