'int' tipidagi moslashtiruvchi tasniflovchi ob'ekti yo'q ()

LDA mavzusidagi modellashtirishni o'z ichiga olamiz, uning maqsadi hujjatlar majmui berilgan bir qator mavzularni yaratishdir. Shuning uchun har bir hujjat turli mavzularga tegishli bo'lishi mumkin.

Bundan tashqari, biz yaratgan modelni ham ko'rib chiqamiz. Yondashuvlardan biri SVM kabi tasnifi usulidan foydalanadi. Mening maqsadim yaratilgan modelni baholashdir.

LDA modelini yaratish uchun ikki xil kod bilan qarayman.

Yondashish 1:

# generate LDA model
id2word = corpora.Dictionary(texts)

# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
                               update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)

Shu tarzda Fit_transform foydalana olmayman

Taxminan 2:

tf_vectorizer = CountVectorizer(max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda_x=lda.fit_transform(tf) 

Birinchidan, LDA modeli uchun fit_transform usuli yo'q, nima uchun men ularning orasidagi farqni tushunmayotganimni bilmayman.

Yaxshiyamki, SVMga birinchi yondashuv bilan yaratgan LDA modelimni (men bu ikki yondashuvni bu erda joylashtiraman, chunki ikkinchi usul bilan tanishaman, fit_transform tufayli hech qanday xatolik yo'q, Buni foydalaning.

Yakuniy kod:

import os
from gensim.models import ldamodel
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC


tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = {'a'}

# Create p_stemmer of class PorterStemmer
lines=[]
p_stemmer = PorterStemmer()
lisOfFiles=[x[2] for x in os.walk("data")]

fullPath = [x[0] for x in os.walk("data")]
for j in lisOfFiles[2]:
    with open(os.path.join(fullPath[2],j)) as f:
                    a=f.read()
                    lines.append(a)

for j in lisOfFiles[3]:
    with open(os.path.join(fullPath[3],j)) as f:
                    a=f.read()
                    lines.append(a)

for j in lisOfFiles[4]:
    with open(os.path.join(fullPath[4],j)) as f:
                    a=f.read()
                    lines.append(a)

# compile sample documents into a list
doc_set = lines
# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    # add tokens to list
    texts.append(stemmed_tokens)

# generate LDA model
id2word = corpora.Dictionary(texts)

# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=10,
                               update_every=1, chunksize=10000, passes=1,gamma_threshold=0.00, minimum_probability=0.00)

# Assigns the topics to the documents in corpus

dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]


#creating the labels
lda_corpus = lda[mm]
label_y=[]
for i in lda_corpus:
    new_y = []
    for l in i:
        sorted_labels = sorted(i, key=lambda z: z[0], reverse=True)
        if l[1] > 0.005:
            new_y.append(l[0])
        label_y.append(new_y)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_df=2,min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(lda, label_y)

Mening kodamda ko'rganimdek, ba'zi bir sabablarga ko'ra birinchi yondashuvni ishlatganman, lekin oxirgi satrda xatolik paydo bo'ladi (ob'ekt int ning ob'ekti no len ()). Shu tarzda yaratilgan LDAni qabul qila olmasligimga o'xshaydi (men shu tarzda fit_transformni ishlatmadim, deb o'ylardim) bu xatoni kodim bilan qanday tuzataman?

Yon tayanchlari:

/home/saria/tfwithpython3.6/bin/python /home/saria/PycharmProjects/TfidfLDA/test4.py
Using TensorFlow backend.
Traceback (most recent call last):
  File "/home/saria/PycharmProjects/TfidfLDA/test4.py", line 92, in 
    classifier.fit(lda, label_y)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/pipeline.py", line 234, in _fit
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
    self.fixed_vocabulary_)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 760, in _count_vocab
    for doc in raw_documents:
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 1054, in __getitem__
    return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 922, in get_document_topics
    gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/gensim/models/ldamodel.py", line 429, in inference
    if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
TypeError: object of type 'int' has no len()

Process finished with exit code 1
2

Javob yo'q

0
Python
Python
372 ishtirokchilar

Bu guruh python dasturlash tilini muhokama qilish uchun. Iltimos, o'zingizni hurmat qiling va faqat dasturlash bo'yicha yozing. Botlar mavzusini @botlarhaqida guruhida muhokama qling! FAQ: @PyFAQ Offtopic: @python_uz_offtopic

Python offtopic group !
Python offtopic group !
150 ishtirokchilar

@python_uz gruppasining offtop gruppasi. offtop bo'lsa ham reklama mumkin emas ) Boshqa dasturlash tiliga oid gruppalar @languages_programming