tf-idf for large number of documents (>100k)

So I'm doing tf-idf for very large corpus(100k documents) and it is giving me memory errors. Is there any implantation that can work well with such large number of documents? I want to make my own stopwords list. Also, it worked on 50k documents, what is the limit of number of documents I can use in this calculation if there is one (sklearn implantation).

 def tf_idf(self, df): df_clean, corpus = self.CleanText(df) tfidf=TfidfVectorizer().fit(corpus) count_tokens=tfidf.get_feature_names_out() article_vect = tfidf.transform(corpus) tf_idf_DF=pd.DataFrame(data=article_vect.toarray(),columns=count_tokens) tf_idf_DF = pd.DataFrame(tf_idf_DF.sum(axis=0).sort_values(ascending=False)) return tf_idf_DF

The error: MemoryError: Unable to allocate 65.3 GiB for an array with shape (96671, 90622) and data type float64

Thanks in advance.

8

2 Answers

TfidfVectorizer has a lot of parameters(TfidfVectorizer), you should set max_df=0.9, min_df=0.1 and max_features=500 and gridsearch these parameters for best solution.

Without setting these parameters, you've got a huge sparsematrix with shape of (96671, 90622) that causing memory error..

welcome to nlp

As @NickODell said, the memory error is only when you convert the sparse matrix into a dense matrix. The solution is to do everything you want using the sparse matrix only

 def tf_idf(self, df): df_clean, corpus = self.CleanText(df) tfidf=TfidfVectorizer().fit(corpus) count_tokens=tfidf.get_feature_names_out() article_vect = tfidf.transform(corpus) #The following line is the solution: tf_idf_DF=pd.DataFrame(data=article_vect.tocsr().sum(axis=0),columns=count_tokens) tf_idf_DF = tf_idf_DF.T.sort_values(ascending=False, by=[0]) tf_idf_DF['word'] = tf_idf_DF.index tf_idf_DF['tf-idf'] = tf_idf_DF[0] tf_idf_DF = tf_idf_DF.reset_index().drop(['index', 0],axis=1) return tf_idf_DF

And that's the solution.

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like