30 05 2015
Document classification using Python and Scikit
Recently i have been working on an interesting problem, in which i wanted to classify the document we are passing. Doing it manual could be a big pain since there 1000 of a document we want to classify on the daily basis, how about we can automate it with using machine learning. Even though I tried many other libraries and algorithms, the combination of CountVectorizer, TfidfTransformer and LinearSVC gave more precise results, I will explain later what those terms are.
First we need to import certain class from sklearn
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB from sklearn.datasets import load_files from sklearn.svm import LinearSVC
These are categories we want our document to be classified as. We also need to build a dataset for each of that category which will be our train data
CATEGORIES=["fashion","electronics","books","entertainment"] train_data=load_files(os.getenv('HOME','/home/')+'/train/',categories=CATEGORIES) docs_trains = [open(f).read() for f in TRAIN_DATA.filenames]
We need to load the list of documents which I have stored home/train path for each category. I have 50+ text document on the folder e.g 50 text documents in fashion folder and so on. More the train data more precise result you will get. I wrote a small script which reads the URL and category from a csv and build the folders accordingly. For text extraction from the link, I used the boilerpipe library which worked pretty good.
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()), ('clf', LinearSVC())]) trained_cl = text_clf.fit(docs_trains, train_data.target) classified_category = train_data.target_names[trained_cl.predict([content])]
The algorithm returned classified_category which matches the category of the given content. Pretty cool, ain’t it? For more benchmark and accuracy details about algorithms, streamhacker has nailed it down http://streamhacker.com/2012/11/22/text-classification-sentiment-analysis-nltk-scikitlearn/
The pipeline, sequentially apply a list of transforms and a final estimator rather than working with each transform separately. We can plugin different transforms and estimator at a time. CountVectorizer converts a collection of text documents to a matrix of token counts. TfidfTransformer – TF is term frequency whereas Tfidf term-frequency times inverse document-frequency. It tells you how important a word is in the document in a collection .load_files is used load text files with categories as subfolder names. LinearSVC(Support Vector Classifier), objective of LinearSVC is to fit the data you provide, returning a “best fit” categories, your data.