How to classify text using SVM in C#

SVM Tutorial : Classify text in C#

In this tutorial I will show you how to classify text with SVM in C#.

The main steps to classify text in C# are:

  1. Create a new project
  2. Install the SVM package with Nuget
  3. Prepare the data
  4. Read the data
  5. Generate a problem
  6. Train the model
  7. Predict

Step 1: Create the Project

Create a new Console application.

SVM Tutorial Csharp

Step 2: Install the SVM package with NuGet

In the solution explorer, right click on "References" and click on "Manage NuGet Packages..."

svm tutorial csharp

Select "Online" and in the search box type "SVM".

svm tutorial csharp 3

You should now see the package. Click on Install, and that's it !

There are several libsvm implementations in C#. We will use because it is the more up to date and it is easily downloadable via NuGet.

Step 3: Prepare the data

Every time you want to classify text, you will need to prepare your data. As this is a language agnostic process I created a different page for it :   How to prepare your data for text classification ?   Check it out before reading the remaining of this svm tutorial !

Step 4: Read the data

The document-term matrix is saved as a CSV file.
It can easily be read in C#.
To do this we will use another Nuget package called CsvReader.

            const string dataFilePath = @"D:\sunnyData.csv";
            var dataTable = DataTable.New.ReadCsv(dataFilePath);
            List<string> x = dataTable.Rows.Select(row => row["Text"]).ToList();
            double[] y = dataTable.Rows.Select(row => double.Parse(row["IsSunny"]))

We have loaded all the sentences in the x variable, and all the class (-1 or +1) in the y variable.

The following code generate the vocabulary:

var vocabulary = x.SelectMany(GetWords).Distinct().OrderBy(word => word).ToList();

Step 5: Generate a problem

Using the data, we are now able to generate a svm_problem.

This is an in-memory representation of the document-term matrix.

            var problemBuilder = new TextClassificationProblemBuilder();
            var problem = problemBuilder.CreateProblem(x, y, vocabulary.ToList());

 Step 6: Create and train a SVM model

const int C = 1;
var model = new C_SVC(problem, KernelHelper.LinearKernel(), C);

When the C_SVC object constructor is called, it immediately calls the Train() method.
We use a linear kernel because they are particularly good with textual data.
The C value is constant for now, but should be optimized for better results.

Step 7: Predict

Once the model is trained, it can be used to make predictions. The main method for that is the Predict method which takes an array of svm_node as a parameter.

            string userInput;
            _predictionDictionary = new Dictionary<int, string> { { -1, "Rainy" }, { 1, "Sunny" } };
                userInput = Console.ReadLine();
                var newX = TextClassificationProblemBuilder.CreateNode(userInput, vocabulary);

                var predictedY = model.Predict(newX);
                Console.WriteLine("The prediction is {0}", _predictionDictionary[(int)predictedY]);
                Console.WriteLine(new string('=', 50));
            } while (userInput != "quit");

Summary of this SVM Tutorial

Congratulations ! You have trained a SVM model and used it to make prediction on unknown data.

If you are interested by learning how to classify text with other languages you can read:

51 thoughts on “How to classify text using SVM in C#”

  1. Hi Alexandre,
    I want to classify tweets as positive or negative based on emoticons as sentiment labels.
    Happy face :), :-), :D, :-)), etc. can be mapped to positive sentiment, while sad face :(, :-(, etc. can be mapped to negative sentiment.
    How can I build a classifier model using libsvm?
    Thank you!

    • Hi. You would need to label each tweet as being positive or not, then train your SVM to classify. But what you try to do seems too simple. There is only a limited set of emoticons you could just write a function checking if the "positive" emoticons are on the tweet and you do not need machine learning at all.

  2. Hi Alexandre,
    I've finished reading your tutorials on SVM. Great job, clear and punctual (although I had some doubts I raised in specific tutorials).

    I would like to ask you if you have any suggestions about my task.
    The task is build a model that predict the labels of a deterministic finite automata (accepting or rejecting, 0 or 1, namely if a string end in an accepting or rejecting state)
    (a sample is in the form 01010001 for example, or 1100 if the alphabet is binary. But the alphabet can change and be of 4 symbols for example[0 1 2 3] In the latter case a sample string can be 32201100 or 3333333332221110 ......)

    1) I must use SVM in C++. What is, according to your experience , the better library to use? (libsvm that is in C++ , or Torch,or here in C, or other)

    2)Samples of training set aren't of same length. How you can handle this situation?

    3)Do you think the data should be encoded in some way?

    4) In general, data aren't linearly separable. What type of kernel is convenient to use?

    Any response ,also partial, will be greatly appreciated,since I'm a beginner in SVM.

    Still compliments and thanks for sharing your knowledge

    • Hello Nick,

      With SVM you will be able to classify the data. So if the taks is "Given a string predit if it is in accepting or rejecting state" it might be possible to use it.
      1) I recommend libsvm, but other libraries might suit your needs better. You have to read the doc of each library to see if it provide an useful functionality for you.
      2) You cannot use sample which are not of the same length. One possibility would be to automatically increment the length of all samples to have the same length as the longest one. You have to figure out what is the best approach depending on your domain. You could for instance start all the string with a special character. The problem is the same as "how to deal with missing data". In some case people takes an average, in other the most frequent value, it really depends on the problem.
      3) I do not know enough about the domain to answer this question.
      4) When data is not linearly separable Gaussian kernel is often recommended.

  3. Very nice guide, but can you tell me what is 'GetWords' in this line
    var vocabulary = x.SelectMany(GetWords).Distinct().OrderBy(word => word).ToList();


Leave a Comment