SVM Tutorial: How to classify text in R

In this tutorial I will show you how to classify text with SVM in R.

rlogo

The main steps to classify text in R are:

  1. Create a new RStudio project
  2. Install the required packages
  3. Read the data
  4. Prepare the data
  5. Create and train the SVM model
  6. Predict with new data

Step 1: Create a new RStudio Project

To begin with, you will need to download and install the RStudio development environment.

Once you installed it, you can create a new project by clicking on "Project: (None)" at the top right of the screen :

svm tutorial : create r studio project

Create a new project in R Studio

This will open the following wizard, which is pretty straightforward:

svm tutorial : create r project

Select "New Directory"

svm tutorial : create empty r project

We will create an empty project

svm tutorial : name the project

Name your project and you are done

Now that the project is created, we will add a new R Script:

svm tutorial : create r script

You can save this script, by giving the name you wish, for instance "Main"

svm tutorial : save R Script in RStudio

Saving our first script

Step 2: Install the required packages

To easily classify text with SVM,  we will use the RTextTools package.

In RStudio, on the right side, you can see a tab named "Packages", select id and then click "Install R packages"

svm tutorial : install packages in rstudio

RStudio list all installed packages

This will open a popup, you now need to enter the name of the package RTextTools.

svm tutorial : install rtextools popup

Be sure to check "Install dependencies"

Once it is installed, it will appear on the package list. Check it to load it in the environment.

svm tutorial : install rtextools

The RTextTools package now appears on the list

Step 3: Read the data

For this tutorial we will use a very simple data set (click to download).
With just a few lines of R, we load the data in memory:

 # Load the data from the csv file
dataDirectory <- "D:/" # put your own folder here
data <- read.csv(paste(dataDirectory, 'sunnyData.csv', sep=""), header = TRUE)

Step 4: Prepare the data

The data has two columns: Text and IsSunny.

svm tutorial sunny data set

Our simple data set

We will need to convert it to a Document Term Matrix.

To understand what a document term matrix is or to learn more about the data set, you can read:   How to prepare your data for text classification ?

Using RTextTools

The RTextTools package provides a powerful way to generate document term matrix with the create_matrix function:

# Create the document term matrix
dtMatrix <- create_matrix(data["Text"])

Typing the name of the matrix in the console, shows us some interesting facts :

svm tutorial : document term matrix statistics

For instance, the sparsity can help us decide whether we should use a linear kernel.

Step 5: Create and train the SVM model

In order to train a SVM model with RTextTools, we need to put the document term matrix inside a container. In the container's configuration, we indicate that the whole data set will be the training set.

# Configure the training data
container <- create_container(dtMatrix, data$IsSunny, trainSize=1:11, virgin=FALSE)

# train a SVM Model
model <- train_model(container, "SVM", kernel="linear", cost=1)

The code above trains a new SVM model with a linear kernel.

Note:

  • both the create_container and train_model  methods are RTextTools methods.
    Under the hood, RTextTools uses the e1071 package which is a R wrapper around libsvm;
  • the virgin=FALSE argument is here to tell RTextTools not to save an analytics_virgin-class object inside the container. This parameter does not interest us now but is required by the function.

Step 6: Predict with new data

Now that our model is trained, we can use it to make new predictions !

We will create new sentences which were not in the training data:

# new data
predictionData <- list("sunny sunny sunny rainy rainy", "rainy sunny rainy rainy", "hello", "", "this is another rainy world")

Before continuing, let's check the new sentences :

  • "sunny sunny sunny rainy rainy"
    This sentence talks more about the sunny weather than the rainy. We expect it to be classified as sunny (+1).
  • "rainy sunny rainy rainy"
    This sentence talks more about the rainy weather than the sunny. We expect it to be classified as rainy (-1).
  • ""
    This sentence has no word, it should return either +1 or -1 in function of the decision boundary.
  • "hello"
    This sentence has a word which was not present in the training set. It will be equivalent to ""
  • "this is another rainy world"
    This sentence has several worlds not in the training set, and the word rainy. It is equivalent to the sentence "rainy" and should be classified "-1"

We create a document term matrix for the test data:

# create a prediction document term matrix
predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix)

Notice that this time we provided the originalMatrix as a parameter. This is because we want the new matrix to use the same vocabulary as the training matrix.
Without this indication, the function will create a document term matrix using all the words of the test data (rainy, sunny, hello, this, is, another, world). It means that each sentence will be represented by a vector containing 7 values (one for each word) !

Such a matrix won't be compatible with the model we trained earlier because it expect vectors containing 2 values (one for rainy, one for sunny).

We now create the container:

# create the corresponding container
predSize = length(predictionData);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)

Two things are different:

  • we use a zero vector for labels, because we want to predict them
  • we specified testSize instead of trainingSize so that the data will be used for testing

Eventually, we can make predictions:

# predict
results <- classify_model(predictionContainer, model)
results
svm tutorial result : Predictions and their probabilities

Predictions and their probabilities

As expected the first sentence has been classified as sunny and the second and last one as rainy.
We can also see that the third and fourth sentences  ("hello" and "") have been classified as rainy, but the probability is only  52% which means our model is not very confident on these two predictions.

Summary of this SVM Tutorial

Congratulations ! You have trained a SVM model and used it to make prediction on unknown data. We only saw a bit of what is possible to do with RTextTools.

In the following articles, I will give you more details about text classification and we will see an example using real life data. Stay tuned !

If you are interested by learning how to classify text with other languages you can read:

You can also get all the code from this article:

I am passionate about machine learning and Support Vector Machine. When I am not writing this blog, you can find me on Kaggle participating in some competition.

81 thoughts on “SVM Tutorial: How to classify text in R

  1. Carlos Alex Gulo

    Congratulations for your post. By the way, I got an error message:
    predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix)
    Error in if (attr(weighting, "Acronym") == "tf-idf") weight <- 1e-09 :
    argument is of length zero

    Could you please, help me to solve it?

    Thank you!

    Reply
    1. Alexandre KOWALCZYK Post author

      Thanks for your comment Carlos. You can download the project on GitHub, it should work correctly. Then you can modify it step by step to find out what causes this error for you.

      Reply
  2. Ciarán O'Kelly

    Same problem here I'm afraid, at exactly the same point. Latest R with both e1071 and RTextTools loaded and following the GitHub script.

    That said, your work on SVM here is just great so thanks very much.

    Reply
  3. Ciarán O'Kelly

    Arch Linux, so that's different to you. I also found this, which is strange. I'm trying to figure the package out now.....well, with a three year old on my lap...

    Have a lovely Christmas!

    Reply
    1. Alexandre KOWALCZYK Post author

      Hi everyone. I installed linux and tried by myself. I was able to reproduce the error.
      The problem comes from the fact that in version 0.6 of the tm package the object tm::weightTfIdf has a property called "acronym" but it was called "Acronym" in version 0.5.

      You can try the following if you are using an R version < 3.1 : remove.packages("tm", lib="~/R/x86_64-pc-linux-gnu-library/3.1")
      packageurl < - "http://cran.r-project.org/src/contrib/Archive/tm/tm_0.5-10.tar.gz" install.packages(packageurl, contriburl=NULL, type="source")

      If you are using the latest R version and don't want to use an older one you can perform the modification yourself in RTextTools like this person.

      Other than that I submitted an issue to the creator of RTextTool so that he can fix the problem. Maybe you could comment on this so that he correct it.

      I hope this help.

      Reply
      1. Tarfa

        Thank u, but I tried to download tm version 0.5.1 also 0.5.10 but the error persists, my os is windows 7, and R version is 3.2.0

        Reply
  4. cryptexcode

    Thanks for the nice post. I am having a problem.
    Running this
    "predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)"

    returns this "Error in x$nrow : $ operator is invalid for atomic vectors"

    Reply
    1. Alexandre KOWALCZYK Post author

      Most likely this is because you are providing an object of the class table which does not support the $ operator. See this link for more details.

      Reply
      1. cryptexcode

        Hi, Solved the issue. 🙂

        I am beginner with R, so became confused with binary packages and source packages. Thats why installed older versions and faced the problem.

        Downloaded RTextTools 1.4.2 source packages. Modified Acronym to acronym, opened it in RStudio, build binary and installed the package and it worked 🙂

        Reply
        1. Arun Kumar

          Hi,
          I am facing the same problem, I want to change the "Acronym" to "acronym" but coulnd't find the create_matrix.R file in my installed package. There are only RTextTools.rdb/rdx file. Please let me know how to modify these files. Can you please share the link to get the R files for the same. I am using Windows platform. PL help

          Reply
          1. Alexandre KOWALCZYK Post author

            Hello Arun. You might need to try to reinstall the package or try using a different version to see if the file is there. Regards.

          2. thesoul

            Hello Arun,

            To edit the files pls load the library using

            library(RTextTools)

            then run the following

            trace("create_matrix",edit=T)

            on line 42 it contains

            if (attr(weighting, "Acronym") == "tf-idf")

            change Acronym to acronym and save the file.

            This worked for me.

  5. Chris

    Hi Alexandre,

    thank you for your tutorial. I have trained the SVM model and now I would like to save it in PMML format to call the model in Java-code. I have experience of saving SVM models from e1071 pacckage but it doesnt work for the model from RTextTools package.

    Could you please give me advise how this problem can be solved?

    Thank you.

    Best regads,
    Chris

    Reply
  6. Mark Nasila

    Greetings Alexandre,

    i have been browsing the web searching for posts on training SVMs using R. i am fitting an svm successfully and even using the predict function to score a validation data-set. I however get erroneous values when i try get a classification table using the table(..) statement. i.e i obtain actual probabilities/scores instead of the classes am predicting. how do i go about fixing this error?

    Reply
    1. Alexandre KOWALCZYK Post author

      Hello Mark. Sorry but I can't help you without a more detailed example. You can post your question with a reproducible example on stackoverflow and I am sure someone will have a solution for you (use the tags R and svm). Best regards.

      Reply
  7. Sonya S.

    Hi Alexandre, thanks for this very helpful SVM tutorial! I've been doing some other research on using SVM for text classification and I'm unable to figure out a way to extract the key features (e.g. words) that are most representative of each category? So, for instance, if i have text which is comprised of fruits and vegetables (e.g. tomato, avocado, banana, apple, celery, potato, etc.), is there a way to see which words fall into which category? Thanks!

    Reply
    1. Alexandre KOWALCZYK Post author

      Hello Sonya. What you can do is to categorize manually each text talking about vegetables, and then you compute the information gain for each word of your vocabulary. As a result you will have a value indicating for each word of your vocabulary how it adds information to the fact the text belongs to the vegetable category. There is also another approach based on Chi-squared.

      Reply
  8. sainath kumar

    IF you want to edit Create_matrix function use this code and follow steps to change the capital letter to small for "Acronym"...

    trace("create_matrix",edit=TRUE)

    Reply
  9. Vasudev

    Sainath, after executing the line you given its giving a prompt with Function and open and closed bracket, no source was there to change the "Acronym"...can you please guide

    Reply
  10. Iris G

    Hello,

    I have been using the create_matrix function with no problems at all to classify tweets content; however, when changing the ngramLength=1 parameter to ngramLength=2, I got the error:

    Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
    'i, j, v' different lengths
    In addition: Warning messages:
    1: In mclapply(unname(content(x)), termFreq, control) :
    all scheduled cores encountered errors in user code
    2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
    NAs introduced by coercion

    Has any of your experienced this?

    Reply
  11. Masud Rahman

    Thanks for the tutorial. However, I am getting this error for this line (predMatrix <- create_matrix(predictionData,originalMatrix=dtMatrix) )
    Error in if (attr(weighting, "Acronym") == "tf-idf") weight <- 1e-09 :
    argument is of length zero

    Reply
    1. Gerald Logor

      In the R console:
      trace("create_matrix",edit=T)

      Edit the source function "create_matrix", on Line 42 (at the current writing time), and change "Acronym" to "acronym". (Big A to small a)

      Hope it helps

      Reply
  12. satish

    thanks for tutorial. (a)kindly give us an example with code stubs for svm with multi-class (more than 2 classes) for text classification. (b) how do we fine tune svm with parameters like cost/gama and others ... we have seen that changing these values changes impacts probability

    Reply
      1. Meghna

        Regarding the Multiclass thing, I came across the Voting Strategy concept as well and the ways to make coupling, can you please share an example regarding this. (I came across this link )

        Reply
  13. ste_hollywood

    Hi, thank you for this tutorial. But, is there a way to obtain all the probabilities with classify_model? Thank you very much.

    Reply
    1. Alexandre KOWALCZYK Post author

      Hi. What do you mean by all the probabilities? It looks to me that we have them with classify_model.

      Reply
  14. satish

    R text classification sometimes gives below error - what could be reason?

    Error in validObject(.Object) :
    invalid class "matrix.csr" object: ra has too few, or too many elements
    Calls: create_container ... new -> initialize -> initialize -> .local -> validOb
    ject
    Execution halted

    Reply
  15. satish

    Hi, Alexandre - regarding error

    R text classification sometimes gives below error - what could be reason?

    Error in validObject(.Object) :
    invalid class "matrix.csr" object: ra has too few, or too many elements
    Calls: create_container ... new -> initialize -> initialize -> .local -> validOb
    ject
    Execution halted

    I have noticed this error occurs if nothing matches test data from training set. If ateast one word matches then error does not occur.

    Reply
  16. satish

    Hi Alexandre - below is code (taken from your site and test data - just change path)

    library(RTextTools)

    directory <-"D:\\textminingORG\\nlp\\vocabulary\\"
    dataText<-read.csv(paste(directory,"Vocabulary_Categorizationv3.txt",sep=""),header= TRUE)
    dataMatrix<-create_matrix(dataText["text"])
    container<-create_container(dataMatrix,dataText$isred,trainSize=1:9,virgin=FALSE)

    model <- train_model(container,"SVM",kernel="linear",cost=1,gamma=0.5)

    predictionData <- list("dangerous be","donated many","generous mostly","TV");

    # create a prediction document term matrix
    trace("create_matrix",edit=T)
    predMatrix <- create_matrix(predictionData, originalMatrix=dataMatrix)

    # create the corresponding container
    predSize = length(predictionData)
    predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)

    # predict
    results <- classify_model(predictionContainer, model)
    results
    plot(results)

    ---------------------------------
    Test data (Vocabulary_Categorizationv3.txt)

    Text,income,isred
    action,1000,0
    animal_trade,2500,1
    arms,500,1
    arrest,200,1
    attack,10,1
    bank_fraud,2000,1
    betting,50,1
    black_money,300,1
    blackmail,200,1
    bogus,90,1

    -------------------------------------------

    predictionData <- list("dangerous be","donated many","generous mostly","TV");
    gives error as nothing matches with training data

    predictionData <- list("attack", "dangerous be","donated many","generous mostly","TV");
    works

    Reply
  17. satish

    this error comes for "unseen data". Also I guess we need to create termdocument matrix instead of normal matrix -capture word frequency.

    Reply
  18. Petya

    Hi, Alex! I've been trying to use the text classification in RTextTools but I am not able to create a matrix. I came across your tutorial which seems very well-explained but even with your data I run into the same problem. The data is read into R because I get the same output when I read the file and type in data. When I run the next command to create dtMatrix, nothing happens. When I aftwerwards type in dtMatrix, the Console returns NULL. I don't understand...

    Reply
    1. Alexandre KOWALCZYK Post author

      Hello Petya. I am afraid I don't have enough explanation to help you with this one. Try asking on stackoverflow . There you will be able to add a code example and you will be likely to receive a quick answer. Best regards.

      Reply
  19. Petya

    It turned out that the problem occured because I did not have the latest version of R. Once I updated, the problem was gone. Thanks anyway!

    Reply
  20. ayita

    Hi,Alex.I get this error when running
    predictionContainer<- create_container(predMatrix,labels=rep(0,predSize),testSize=1:predSize, virgin=FALSE);
    Error in validObject(.Object) :
    invalid class “matrix.csr” object: ra has too few, or too many elements.
    what could be reason?

    Reply
    1. Alexandre KOWALCZYK Post author

      Hello ayita. This is because when you make predictions you should generate a matrix containing words from all the words of the document-term matrix used at training time. If you try to make predictions with words not in this dictionary, it won't work. Keep in mind that when making predictions you need to give the model a vector, and that this vector should have the same shape than the one used at training time.
      For instance if I trained with 5 words I have a vector [0,1,0,1,0] because there is two words in the sentence. If I get a new sentence with no word from the original dictionary the vector should be [0,0,0,0,0].
      I hope it helps you.

      Reply
  21. Iris G

    Hello all,

    Does anyone know how to get the decision values in RTextTools results after applying the classify_model function? I thought I could add the decision.values parameter in the classify_model function to get these values, but nothing.

    Reply
  22. Iris G

    Never mind. I figured out I can use the predict function from the e1071 package. Use the model and the prediction matrix obtained from RTextTools and use them in the predict function

    Reply
  23. ayita

    Hello Alex
    when i try to run the above code you give
    but in
    inspect(predMatrix)
    <>
    Non-/sparse entries: 0/6
    Sparsity : 100%
    Maximal term length: 5
    Weighting : term frequency (tf)

    Terms
    Docs raini sunni
    1 0 0
    2 0 0
    3 0 0
    why it be both 0? it's correct?

    Reply
  24. ayita

    Hello Alex, when I try to run the above code you give
    but in
    inspect(predMatrix)
    <>
    Non-/sparse entries: 0/6
    Sparsity : 100%
    Maximal term length: 5
    Weighting : term frequency (tf)

    Terms
    Docs raini sunni
    1 0 0
    2 0 0
    3 0 0
    why it be both 0? it's correct?

    Reply
  25. Alexandre KOWALCZYK Post author

    If your 3 examples does not contain the word sunny or rainy it is normal that both are 0.

    Reply
  26. Barry McDonnell

    Alexandre,

    Great post. Got it working after some initial errors and then some Googling. I have a question though. I would like to add a third variable, say 'cloudy'. How would that be added in?

    Subsequently, I want to run this on a whole text document with 100s of words. What changes would need to be done in order to accommodate a large text file?

    Barry

    Reply
    1. Alexandre KOWALCZYK Post author

      It should work with more words in the vocabulary. If you want to add a new word "cloudy" it will just add a dimension to the vector.
      However if you want to predict another class "IsCloudy" it become a multi-class classification problem. For huge text file you can use Liblinear which dramatically speed up the computation. You can also use methods to reduce the size of the vocabulary (remove common words, use stemming, etc...)

      Reply
  27. Ali

    Hi Alexandre,

    How can I introduce gamma values to cross_validate? As I tried as given below, I could manage to try different values of cost parameter. But from the documentation I could not find gamma parameter.

    Thanks in advance.

    vector = 2^c(-8,-4,-2,0,2,4,8)
    best_c = 0
    best_acc = 0
    for (var in vector) {
    for (var2 in vector2) {

    svm best_acc){ best_acc <- svm$meanAccuracy; best_c <- var}
    }
    }

    best_c
    best_acc

    Reply
  28. samra

    when i pass this command
    predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)
    i get the error
    too few or too many objects

    Reply
  29. Pingback: RTextTools – Résoudre l’erreur de la fonction create_matrix() – Blog scientifique de Jean-Charles RISCH

  30. Koen

    Hi Alexandre,

    Thank you for this great tutorial! I've used what I learned from this tutorial to train an SVM to classify over 15.000 job titles from our CRM system to several categories. The insights we've gained from the results have proven very valuable!

    Many thanks!

    Reply
  31. Pranav Lal

    Hi Alexandre,

    Thanks for writing. Its a great article. I have a post return customer comment data for the damaged products they received from an online retailer. The comment could be something like "Wen received the phone there was scratches on the metallic panel of the phone ..plus wen ivremove the back panel it was sticky.". On receiving the item, the warehouse team does a physical examination of the item and mark the defect in the system in the form of a drop-down comment phrase (And mark no issues, in case they dont find an issue). I have to compare the two texts to determine the % of returned items which actually were defective. Can it be done using the methodology mentioned by you?

    Any help would be highly appreciated.

    Thanks

    Reply
    1. Alexandre KOWALCZYK Post author

      If you train a classifier to classify comments which report issues, I guess you can build a system classifying the first and the second comment and if they disagree perform a manual verification.

      Reply
  32. Raed

    Thanks Alexandre
    a great work from 2014 until know everyone are using your tutorial thanks man
    I try the code and get this error
    > predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix)
    Error in if (attr(weighting, "Acronym") == "tf-idf") weight <- 1e-09 :
    argument is of length zero

    Reply
  33. datashots

    Hi Alex!

    I'm getting the following error when I try to train the model with my own dataset (multi-class):

    Error in `[.matrix.coo`(x, rw, cl) : Subscripts out of bound

    Do you know what would be the problem?
    Thanks!

    Reply
  34. Hany

    Can you share sample for Python if you have it?, i found SKLearn library, but we have limitation that it didn't accept text labels so i have to normalize it before building model.

    Reply
  35. subhash

    Hii,
    I am new to R and Machine Learning, I am getting this error

    Error in predict.svm(model, container@classification_matrix, prob = TRUE, :
    test data does not match model !

    Reply
  36. Pingback: R: Text classification using SMOTE and SVM – evolvingprogrammer

  37. Robert

    hi, great tutorial i would like to use other classify-text like trees, random forest or neuronal networks in R. But I cant change the matrix to data-frame because is too big ( a lot of features). So i would like to use document-term-matrix or dfmSparce like your tutorial.

    Thanks, Robert.

    Espanish, similiar meaning.
    Hola, muy buen tutorial! de la forma que usaste Rtools me gustaria saber si se puede usar otro algoritmo de clasificacion de la misma forma. Porque al tener tantos features, intento pasarlo a un data frame,pero no puedo porque es muy grande.

    Tengo el siguiente error cuando quiero pasar de dfmSparse a data-frame:
    Error in asMethod(object) :
    Cholmod error 'out of memory' at file ../Core/cholmod_memory.c, line 147

    En tu ejemplo lo pasas directamente desde una matriz al algoritmo machine learning. Quisiera saber como aplicarlo en otro algoritmos, si podria mostrarme un ejemplo seria muy feliz.

    Reply

Leave a Reply