How do I combine text and numerical features in training set for machine learning?
I am trying to predict the number of likes on a post in a social network basing on both on numerical features and text features. Now I have dataframe with required features, but I don't know what to do with posts text data. Should I vectorize it/do smth else in order to get a suitable train matrix? I am going to use LinearSVC from sklearn for analysis.
There are a lot of different ways you can transform your text features into numerical ones.
One of the most common ways is the Bag of Words approach. Where you transform your text into an array with the occurrences of each word.
If you are using scikit-learn I recommend you reading their Text Feature extraction User Guide.
Also look at the NLTK toolkit for more complex ways to process your text data.