How to prepare your data for text classification ?

Understand how data is represented

For text classification, you often begin with some text you want to classify. For instance, you have quotes and wants to find the quotes about love. Or you have emails and you want to separate spam from legitimate emails.

SVM is a supervised-learning algorithm. It means you will need to manually label some data with what you think is the correct choice. Then you train a SVM model with it. Eventually you can use it to predict unlabeled data.

In order to train a SVM model for text classification, you will need to prepare your data :

  1. Label the data
  2. Generate a vocabulary
  3. Create a document-term matrix.

If you don't have any data yet, you can download one of the available datasets provided for free by the community.

However, you might be surprised by what you will find inside the files:

lib svm dataset

Datasets are not very easy to read

We want to classify text, but there is only numbers in this file!

A (very) simple dataset for text classification

To understand better how data is represented, I will give you a simple example.
We will try to classify some text about the weather using a support vector machine.

Our goal is to predict if the text is about a sunny or a rainy weather.
To simplify more, each sentence is written with these two words:  sunny or rainy. 

Text classification : unlabeled data

For each line, we write

  • +1 if we think it should be classified as "sunny"
  • -1 if we think it should be classified as "rainy".

Text classification : initial dataset

Then we will look at each word in our dataset, and generate a vocabulary.

We just write each word in alphabetical order.

Text classification : vocabulary

Eventually, for each sentence, we create a document-term matrix.
We begin the line with the class of the sentence (+1 or -1).
Then we write the index of the word in our vocabulary. For the word sunny it is 2.
We add a colon, and then the number of time the word appears in the sentence.

Text classification: Document Term Matrix

As you can see, we now have a dataset which looks like the example dataset found earlier.



18 thoughts on “How to prepare your data for text classification ?

  1. uthsavi

    the `vocabulary.txt` created where should that be loaded ? then after the classification prediction can we convert the labeled numbers into texts again?

    1. Alexandre KOWALCZYK Post author

      There are two tutorials explaining how to use this file. One for C#, the other for R. As each word has a direct link with a word, you can easily convert them back in order to see which vectors were created using the provided text.

  2. Sandeep

    I want to classify the difference between square and rectangle, can you give the idea to build a data set

    1. Alexandre KOWALCZYK Post author

      This is a pretty broad question. If this is for image classification, you should start collecting images and labelling them manually. Or if you want to classify using the length and width of the figure, then you need a dataset of these elements. It is up to you to create your own features.

      1. Rish

        Hi, I want to classify my image using SVM ( and C#). I am trained and labelled the training image using EmguCv. But how to do recognize and classification using SVM? I am very beginner in Emgu and SVM. Thanks in Advance

        1. Alexandre KOWALCZYK Post author

          Hello. You can start by looking for tutorial about image classification with SVM on google, and for tutorial on how to using EmguCv. Regards.

  3. patricia

    Thanks for all the information you shared Alexandre.
    What if I want to classify records into more than two categories (+1, -1). For example, I have a file with questions that could be any of 5 types: Awareness, Intent, Favorability, Preference and Attitudes. Each has of course a template with certain words. e.g. Awareness: Which of the following brands have you heard of... are you familiar with...


    Hi sir i m doing my final year bachelors project which is based on implementing svm on text means we are doing text classification using svm.So i want to know that can svm be implemented without forming the matrix or not.Or second question is that storing our dataset in csv file format is neccessary or not. Sir kindly reply me when u read this


    1. Alexandre KOWALCZYK Post author

      Hello. Thank you for your comment. You can't really use SVMs without using vectors and matrices. Storing the dataset can be done in any format as long as you are able to read it and transform it before feeding it to the model.

  5. Irfan Khalid

    Sit what about tf-idf? Can we use this technique for term-document metric formation?

  6. Fawaz

    Thank you for the information
    I plan to use SVM to classify a Java Script
    Please can you help me to find my way for that
    - how to make dataset
    - how to train SVM
    Thank you

  7. do683

    Let's say I have a series of text files that I want to classify. Is there an easy way to label a text file as one of two different categories without turning it into a .csv and doing the labeling in Excel??

  8. Ron Rudd

    Hello Alex,

    I want to be able to create a document-term matrix from my dataset, but want to be able to identify these words as factors (weighted). Can you point me to a simple method to accomplish this? Many thanks in advance.

    1. Alexandre KOWALCZYK Post author

      Sorry I do not understand what you are trying to do. You wish to find which words are more important?

  9. Wahab Khan

    Hello Alexandre,
    First of all i appreciate your work and your continuous support for all users.
    i am planning to use SVM for part of speech tagging task. Before this i have achieved the same task succesfully with various models such as CRF, RNN and HMM.
    Now i am planing to accomplish the same task with SVM.
    i am confused that how to prepare data more specifically the document-term matrix from my training data.

    i have 50 document in which text is organized sentence wise and each word in the sentences have its corresponding POS tag/class label, the data in the 50 documents are tagged with 35 unique. so for SVM i have 35 classes instead of two. My training data format i like below:

    word1/NNP word2/PSP word3/NNP word4/VBI word5/PSP word6/NN word7/VBF word8/CC..... and son.

Comments are closed.