Shiva Kumar 's Blog

Start small,Think Big

Document classification using Python and Scikit

Recently i have been working on an interesting problem, in which i wanted to classify the document we are passing. Doing it manual could be a big pain since there 1000 of a document we want to classify on the daily basis, how about we can automate it with using machine learning. Even though I tried many other libraries and algorithms, the combination of CountVectorizer, TfidfTransformer and LinearSVC gave more precise results, I will explain later what those terms are.


First we need to import certain class from sklearn

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import load_files
from sklearn.svm import LinearSVC

These are categories we want our document to be classified as. We also need to build a dataset for each of that category which will be our train data



docs_trains = [open(f).read() for f in TRAIN_DATA.filenames]

We need to load the list of documents which I have stored home/train path for each category. I have 50+ text document on the folder e.g  50 text documents in fashion folder and so on. More the train data more precise result you will get. I wrote a small script which reads the URL and category  from a csv and build the folders accordingly. For text extraction from the link, I used the boilerpipe library which worked pretty good.


Train files

text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()), ('clf', LinearSVC())])
trained_cl =,

classified_category = train_data.target_names[trained_cl.predict([content])]

The algorithm returned classified_category which matches the category of the given content. Pretty cool, ain’t it? For more benchmark and accuracy details about algorithms, streamhacker has nailed it down

The pipeline, sequentially apply a list of transforms and a final estimator rather than working with each transform separately. We can plugin different transforms and estimator at a time. CountVectorizer converts a collection of text documents to a matrix of token counts. TfidfTransformer – TF is term frequency whereas Tfidf term-frequency times inverse document-frequency. It tells you how important a word is in the document in a collection .load_files is used load text files with categories as subfolder names. LinearSVC(Support Vector Classifier), objective of LinearSVC is to fit the data you provide, returning a “best fit” categories, your data.



, ,

Leave a Reply

Your email address will not be published. Required fields are marked *