Document Classification Using R


In this post, I would like to share the concepts of Document Classification, its need and applications and how can we do it in R. So let’s get started…

Have you ever wondered how “gmail or yahoo” very intelligently decides to put which messages in your inbox and which messages should be directed to spam folder.

Or, suppose you work in a library where you are given 100 e-books to be updated into respective catalogs. And suppose these e-books belong to 10 different subjects. Then will you read every e-book (say only the 1st page) and then update into its respective catalog category.

What if you just provide the content of the these files to a black-box engine that smartly classifies the 100 e-books into the categories/subjects defined by you.

This is what Document Classification does.

To the engine you give some couple of statements with proper category tags and then the engine intelligently learns which statement comes from which category depending on the unique words in each statement.

For example :

1. If a document belongs to astronomy, then you may find words like space, satellite, comets, stars, galaxy etc.

2. If a document belongs to mathematics, then words like partial differentiation, algebra, theorems, integration, mensuration etc can be found.

Therefore, the engine creates a document term matrix where  the columns will contain unique words found in all the documents and the rows will represent documents(statements). Each cell will have a value either 1 if the word is present in that document or 0 if not.

So the engine breaks your documents into a binary document-term(words) matrix.

When a new unseen document comes to the engine, it tries to find how similar is this statement to other statements and based upon the distance it classifies it into a specific category.

Here is an example:

Suppose these are your 4 statements. Then the corresponding Document-Term matrix will be like:

And suppose an unseen statement “I like bright sunny morning” shows up. Then the document-term matrix for this statement will look like:

So the level of match of this unseen sentence with all others can be computed by the similarity between each binary vector of previous sentences and the binary vector for this sentence.

Sentence 1: 010010000
Unseen:     011100001
Matches:    3

Sentence 2: 100000010
Unseen:     011100001
Matches:    3

Sentence 3: 001100001
Unseen:        011100001
Matches:    8

Sentence 4: 000001100
Unseen:      011100001
Matches:    1

Clearly, we can see that this sentence is very similar to sentence 3 and hence it should belong to category “Weather”.

This example is just to explain the concept. Of course, there can be other techniques of Document Classification as well.

Now let us see how we can do this exercise in R

The data which I am using are the reviews (since we need text data) of Mobile phones, Books and Pen-drive bought by people from Flipkart and has been collected by a technique called Web-Crawling (which I will share how to do that in a separate post).

Let’s begin.



Let’s have a look at few reviews from each class.

Class 1: Mobile

Class 2: Book

Class 3: Pen-drive

#Let us prepare the data                                                         data=rbind(mobile,books,pendrive)                                                                                 table(data$Label)

We have 577 review of  mobile, 1148 of books and 589 of pen-drive bought by people through Flipkart.

#And now lets prepare the train and test data



#import the library

There is one thing that needs to be changed in following function,to avoid an error during execution.(Mandatory).And this change is valid till your session is active.You need to make this change every time you work on a new session.trace commands opens the source code of create_matrix function and we need to change one simple thing. 

Just replace A by a in the word Acronym.

Yes, it is silly, but if this is not done you cannot complete this exercise. And trust me I have spent hours and hours to debug this problem.


#Save the changes and you are done.
#Now lets create a doc-term matrix of trainData

mat1=create_matrix(trainData$Reviews,language = “en”,removeNumbers = T,removeSparseTerms=0.998, removeStopwords = T)

con1=create_container(matrix = mat1, labels = trainData$Label, trainSize = 1:1500, testSize1501:1620, virgin = F)


Here let us understand what is happening. First, we created a Document-Term matrix with the arguments passed above. Then the created matrix is fed in for creating a list of objects called container. This container contains internally 1500 statements for training the model and 120 statements for validation. (This is in-sample validation and hence we set Virgin flag = False

specifying that the sample has class labels).

#Then train models (Here we have used 4 algorithms)

rf=train_model(con1,”RF”)                                                            nnet=train_model(con1,”NNET”)                                                       slda=train_model(con1,”SLDA”)

#classify data (1501: 1620) specified in con1, for which engine assumes, data has class values

cl_svm1=classify_model(con1,svm)                                                     cl_rf1=classify_model(con1,rf)                                                       cl_nnet=classify_model(con1,nnet)                                                   cl_slda=classify_model(con1,slda)

#create a combined result.
cl_anal1=create_analytics(con1,classification_results = cbind(cl_svm1,cl_rf1,cl_nnet,cl_slda))

#See the document summary(this gives us predicted classes)View(cl_anal1@document_summary)

The picture might not be that clear. Here with respect to each algorithm we have the predicted labels with their probabilities. Then we have a “Manual Code” value which is the original class value. Then we have “Consensus_Code” which all the algorithm collectively decides upon 1 predicted class value. Therefore, if we create a confusion matrix of these 2 variables we can certainly get our accuracy value:

Accuarcy =  ((29+54+29)/120)*100                                                

=  93.33%


#Now we need to classify the data points in the test dataset using already trained models                                                                                                                                                           cl_svm2=classify_model(con2,svm)

#Create a combined result
cl_anal2=create_analytics(con2,classification_results = cbind(cl_svm2,cl_rf2,cl_nnet2,cl_slda2))

#See the document summary(this gives us predicted classes)

#Add Target Variable to the document summary
testData_Result=data.frame(cl_anal2@document_summary,”Target Class”=testData$Label)

Let’s view a snapshot of result classification…Result

#To get the efficacy we can create confusion matrix using algorithm’s consensus. But #most of the time it is gives misleading consensus even though you can see individual #predicted values of all algorithms are in line with actual values.

#So instead of using algorithm’s consensus, we will use consensus based on the #following logic…

for (i in 1:nrow(testData_Result))
for (j in 1:4)
#we say, if more than two algorithms agree on same class, then the predicted class should #be equal to actual Target class.
#or choose SVM’s class as predicted class.

#It’s time to look at the confusion matrix to see the efficacy of the classification


So we can see that the accuracy of this model is 95.7% and if you notice our consensus logic gives better accuracy than the consensus used during training data set.

This is one example of Document Classification. There may be other solutions as well.

Feel free to post any query or suggestion.

Happy Learning till then…