SVM Tutorial: How to classify text in R

In this tutorial I will show you how to classify text with SVM in R.

rlogo

The main steps to classify text in R are:

  1. Create a new RStudio project
  2. Install the required packages
  3. Read the data
  4. Prepare the data
  5. Create and train the SVM model
  6. Predict with new data

Step 1: Create a new RStudio Project

To begin with, you will need to download and install the RStudio development environment.

Once you installed it, you can create a new project by clicking on "Project: (None)" at the top right of the screen :

svm tutorial : create r studio project
Create a new project in R Studio

This will open the following wizard, which is pretty straightforward:

svm tutorial : create r project
Select "New Directory"
svm tutorial : create empty r project
We will create an empty project
svm tutorial : name the project
Name your project and you are done

Now that the project is created, we will add a new R Script:

svm tutorial : create r script

You can save this script, by giving the name you wish, for instance "Main"

svm tutorial : save R Script in RStudio
Saving our first script

Step 2: Install the required packages

To easily classify text with SVM,  we will use the RTextTools package.

In RStudio, on the right side, you can see a tab named "Packages", select id and then click "Install R packages"

svm tutorial : install packages in rstudio
RStudio list all installed packages

This will open a popup, you now need to enter the name of the package RTextTools.

svm tutorial : install rtextools popup
Be sure to check "Install dependencies"

Once it is installed, it will appear on the package list. Check it to load it in the environment.

svm tutorial : install rtextools
The RTextTools package now appears on the list

Step 3: Read the data

For this tutorial we will use a very simple data set (click to download).
With just a few lines of R, we load the data in memory:

 # Load the data from the csv file
dataDirectory <- "D:/" # put your own folder here
data <- read.csv(paste(dataDirectory, 'sunnyData.csv', sep=""), header = TRUE)

Step 4: Prepare the data

The data has two columns: Text and IsSunny.

svm tutorial sunny data set
Our simple data set

We will need to convert it to a Document Term Matrix.

To understand what a document term matrix is or to learn more about the data set, you can read:   How to prepare your data for text classification ?

Using RTextTools

The RTextTools package provides a powerful way to generate document term matrix with the create_matrix function:

# Create the document term matrix
dtMatrix <- create_matrix(data["Text"])

Typing the name of the matrix in the console, shows us some interesting facts :

svm tutorial : document term matrix statistics

For instance, the sparsity can help us decide whether we should use a linear kernel.

Step 5: Create and train the SVM model

In order to train a SVM model with RTextTools, we need to put the document term matrix inside a container. In the container's configuration, we indicate that the whole data set will be the training set.

# Configure the training data
container <- create_container(dtMatrix, data$IsSunny, trainSize=1:11, virgin=FALSE)

# train a SVM Model
model <- train_model(container, "SVM", kernel="linear", cost=1)

The code above trains a new SVM model with a linear kernel.

Note:

  • both the create_container and train_model  methods are RTextTools methods.
    Under the hood, RTextTools uses the e1071 package which is a R wrapper around libsvm;
  • the virgin=FALSE argument is here to tell RTextTools not to save an analytics_virgin-class object inside the container. This parameter does not interest us now but is required by the function.

Step 6: Predict with new data

Now that our model is trained, we can use it to make new predictions !

We will create new sentences which were not in the training data:

# new data
predictionData <- list("sunny sunny sunny rainy rainy", "rainy sunny rainy rainy", "hello", "", "this is another rainy world")

Before continuing, let's check the new sentences :

  • "sunny sunny sunny rainy rainy"
    This sentence talks more about the sunny weather than the rainy. We expect it to be classified as sunny (+1).
  • "rainy sunny rainy rainy"
    This sentence talks more about the rainy weather than the sunny. We expect it to be classified as rainy (-1).
  • ""
    This sentence has no word, it should return either +1 or -1 in function of the decision boundary.
  • "hello"
    This sentence has a word which was not present in the training set. It will be equivalent to ""
  • "this is another rainy world"
    This sentence has several worlds not in the training set, and the word rainy. It is equivalent to the sentence "rainy" and should be classified "-1"

We create a document term matrix for the test data:

# create a prediction document term matrix
predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix)

Notice that this time we provided the originalMatrix as a parameter. This is because we want the new matrix to use the same vocabulary as the training matrix.
Without this indication, the function will create a document term matrix using all the words of the test data (rainy, sunny, hello, this, is, another, world). It means that each sentence will be represented by a vector containing 7 values (one for each word) !

Such a matrix won't be compatible with the model we trained earlier because it expect vectors containing 2 values (one for rainy, one for sunny).

We now create the container:

# create the corresponding container
predSize = length(predictionData);
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)

Two things are different:

  • we use a zero vector for labels, because we want to predict them
  • we specified testSize instead of trainingSize so that the data will be used for testing

Eventually, we can make predictions:

# predict
results <- classify_model(predictionContainer, model)
results
svm tutorial result : Predictions and their probabilities
Predictions and their probabilities

As expected the first sentence has been classified as sunny and the second and last one as rainy.
We can also see that the third and fourth sentences  ("hello" and "") have been classified as rainy, but the probability is only  52% which means our model is not very confident on these two predictions.

Summary of this SVM Tutorial

Congratulations ! You have trained a SVM model and used it to make prediction on unknown data. We only saw a bit of what is possible to do with RTextTools.

If you are interested by learning how to classify text with other languages you can read:

You can also get all the code from this article:

99 thoughts on “SVM Tutorial: How to classify text in R”

  1. Hi I'm trying to use my SVM model to predict as the last step. But an error occurs...can't figure out what went wrong. Many thanks for any hints!

    # CREATE THE DOCUMENT-TERM MATRIX
    doc_matrix <- create_matrix(sl_texts$campaign_text, language="english", removeNumbers=TRUE,
    stemWords=TRUE, removeSparseTerms=.998)
    # Creating a container
    container <- create_container(doc_matrix, sl_texts$music_subcategory, trainSize=1:200,
    testSize=201:270, virgin=FALSE)

    SVM <- train_model(container,"SVM")

    #Test predict on 400
    predit_400 pred_400_matrix size = length(pred_400_matrix)
    > prediction400Container results <- classify_model(prediction400Container, SVM)
    Error in classify_model(prediction400Container, SVM) :
    object 'results_table' not found

    Reply
  2. Hello Alexandre,

    This has been a great tutorial and has helped me a lot in getting my work done. However their is an issue with my code. I have created multiple models to use as a multi classifier. Each model gives the correct label but shows relatively high probability for all other inputs even when they don't have any of the terms. I tried to recreate the problem with a sample code but in the below code even labels are not rightly predicted. Please shed some light on how SVM give probabilities.

    Reply
  3. Thanks Alexandre for the tutorial and discussion. I got the create_matrix error as well and corrected as per the discussion thread, however it seems I am the only one getting an error in the result line.

    > results_svm <- classify_model(predictionContainer, model_svm)
    Error in predict.svm(model, container@classification_matrix, prob = TRUE, :
    test data does not match model !

    Appreciate your advice 🙂

    Reply
  4. Hello Alexandre

    Thanks for the tutorial. I would like to understand, how SVM plots the hyperplane if I have two independent categorical variables and a binary dependent variable?

    Reply

Leave a Comment