Learning Word Vectors for Sentiment Analysis
Unsupervised vector-based approaches to semanticscan model rich lexical meanings, butthey largely fail to capture sentiment informationthat is central to many word meanings andimportant for a wide range of NLP tasks. Wepresent a model that uses a mix of unsupervisedand supervised techniques tosentiment to have similar representations. We can how closely its representation vector u matches theefficiently learn parameters for the joint objective scaling direction of 0. This idea is similar to thefunction using alternating maximizationword vector inner product used in the log-bilinearlanguage model of Mnih and Hinton(2007)3.1 Capturing semantic SimilaritiesEquation 1 resembles the probabilistic model ofWe build a probabilistic model of a document us- LDA(Blei et al., 2003), which models documentsing a continuous mixture distribution over words in- as mixtures of latent topics. One could view the endexed by a multi-dimensional random variable 6. tries of a word vector as that word' s associationWe assume words in a document are conditionally strength with respect to each latent topic dimensionindependent given the mixture variable 8. We assign The random variable 0 then defines a weighting overa probability to a document d using a joint distribu- topics. However, our model does not attempt totion over the document and 6. The model assumes model individual topics but instead directly modelscach word wi e d is conditionally independent of word probabilities conditioned on the topic mixturethe other words given 6. The probability of a docu- variable 0. Because of the log-linear formulation ofment is thusthe conditional distribution. 0 is a vector in rs andNnot restricted to the unit simplex as it is in LDap(d)=/p(d, 0)de=p(0) ip(wile)deWe now derive maximum likelihood learning forthis model when given a set of unlabeled documentsD. In maximum likelihood learning we maximizeWhere N is the number of words in d and w'i is the probability of the observed data given the modelthe ith word in d. We use a Gaussian prior on 0parameters. We assume documents dk C D are i.i.d.We define the conditional distribution p(wi e)ussamples. Thus the learning problem becomesing a log-linear model with parameters R and bThe energy function uses a word representation matrix RE R(xVI where each word w(represented maxp(D; R, 6,∏/)2Rbdas a one-on vector) in the vocabulary v has a B-dk2∈Ddimensional vector representation ow- Rw corre(5)sponding to that words column in R. The random Using maximum a posteriori (MAP)estimates for Bvariable 0 is also a B-dimensional vector, 0 C RGwe approximate this learning problem asThich weights each of the B dimensions of wordrepresentation vectors. We additionally introduce abias bu for each word to capture differences in over-maxR. bI i p(ek)Ip(w 10n: R,b),(6)all word frequencies. The energy assigned to a worddk∈Dw given these model parameters iswhere 0k denotes the MaP estimate of 0 for dkE(v;0,d,b)=-01om-b(2) We introduce a Frobenious norm regularization termfor the word representation matrix R. The word biTo obtain the distribution p(wle)we use a softmax, ases b are not regularized reflecting the fact that wewant the biases to capture whatever overall word fre-p(|6;R,b)exp(E(w; 8,a, ba)quency statistics are present in the data. By taking∑cvxp(-E(;0,,b)the logarithm and simplifying we obtain the final ob(3)jectiveexp(0 pu t bw)(4)wu ey exp(e u+b)fFh+∑A2 logplwile;R,bThe number of terms in the denominator's summation grows linearly in V, making exact computation of the distribution possible. For a given which is maximized with respect to R and b. The6, a word w's occurrence probability is related to 44hyper-parameters in the model are the regularizationweights(A and v), and the word vector dimension- where o(a) is the logistic function and i E r is thelity blogistic regression weight vector. We additionallyintroduce a scalar bias b for the classifier3.2 Capturing word sentimentThe logistic regression weights y/ and be defineThe model presented so far does not explicitly cap-linear hyperplane in the word vector space whereture sentiment information. Applying this algorithma word vector's positive sentiment probability deto documents will produce representations wherepends on where it lies with respect to this hyperwords that occur together in documents have simplane. Learning over a collection of documents reilar representations. However, this unsupervisedsults in words residing different distances from thisapproach has no explicit way of capturing whichyperplane based on the average polarity of docuwords are predictive of sentiment as opposed toments in which the words occurcontent-related. Much previous work in natural lanGiven a set of labeled documents D where sk isguage processing achieves better representations bythe sentiment label for document dk, we wish tolearning from multiple tasks( Collobert and Westonmaximize the probability of document labels given2008; Finkel and Manning, 2009). Following thisthe documents. We assume documents in the collectheme we introduce a second task to utilize labeledtion and words within a document are i.i.d. samplesdocuments to improve our model's word representaBy maximizing the log-objective we obtaintionsSentiment is a complex, multi-dimensional concept. Depending on which aspects of sentiment we∑∑0g(sm)(0)wish to capture, we can give some body of text asentiment label s which can be categorical, continuy on text a the conditional probability p(skW; R, v/,be)isous, or multi-dimensional. To leverage such labelseasily obtained from equation 9we introduce an objective that the word vectors of 3.3 Learninour model should predict the sentiment label usinThe full learning objective maximizes a sum of thesome appropriate predictor,two objectives presented. This produces a final obctive function of(8)USing an appropriate predictor function f(a)we训B+∑川+∑1gp(a1;Rbs. We can then improve our word vector u to betterpredict the sentiment labels of contexts in which that+∑s∑ gp(skI;B的word occurkFor simplicity we consider the case where the sensk denotes the number of documents in the datasettiment label s is a scalar continuous value reprewith the same rounded value of Sk (i.e.Sk<0.5senting sentiment polarity of a document. This cap- and Sk 2 0.5). We introduce the weighting SkT tOtures the case of many online reviews where doccombat the well-known imbalance in ratings presentuments are associated with a label on a star ratingin review collections. This weighting prevents thescale. We linearly map such star values to the inter-overall distribution of document ratings from affect-val s[ 0. 1 and treat them as a probability of posing the estimate of document ratings in which a par-itive sentiment polarity. Using this formulation, weticular word occurs. The hyper-parameters of theemploy a logistic regression as our predictor f(a). model are the regularization weights(a and v), andWe use wu's vector representation u and regressionthe word vector dimensionality Bweights y to express this asMaximizing the objective function with respect toR, b, 2, and bc is a non-convex problem. We usep(s=1m;7,v)=a(2h+be)asaltermatng maximization, which first optimizes theword representations(R, b, a, and bc) while leav- beled set of reviews contains neutral reviews as welling the MAP estimates(0) fixed. Then we find the as those which are polarized as found in the labelednew MAP estimate for each document while leav- set. Training the model with additional unlabeleding the word representations fixed, and continue this data captures a common scenario where the amountprocess until convergence. The optimization algo- of labeled data is small relative to the amount of unrithm quickly finds a global solution for each 0k be- labeled data available. For all word vector modelscause we have a low-dimensional, convex problem we use 50-dimensional vectorsin each Ok. Because the MAP estimation problems As a qualitative assessment of word represenfor different documents are independent, we can tations, we visualize the words most similar to asolve them on separate machines in parallel. This query word using vector similarity of the learnedfacilitates scaling the model to document collections representations. Given a query word w and an-with hundreds of thousands of documentsother word w we obtain their vector representationsPa and u, and evaluate their cosine similarity as4Eⅹ perimentsS(pu, dw)=Tuliilgs/. By assessing the simisentence-level categorization tasks in the domain of words deemed most similar by the mode find theWe evaluate our model with document-level andlarity of w with all other words w, we can find theonline movie reviews. For document categorizeTable I shows the most similar words to givention, we compare our method to previously piuery words using our models word representationslished results on a standard dataset, and introduce as well as those of LSA. all of these vectors capa new dataset for the task. In both tasks we com-ture broad semantic similarities however both ver-pare our model's word representations with several sions of our model seem to do better than LSA inbag of words weighting methods, and alternative apavoiding accidental distributional similarities(e. gproaches to word vector inductionscrewball and grant as similar to romantic)A comparison of the two versions of our model also begins4.1 Word Representation Learningto highlight the importance of adding sentiment in-formation. In general, words indicative of sentimentWe induce word representations with our model us-ing 25, 000 movie reviews from IMDB. Becausetend to have high similarity with words of the samesome movies receive substantially more reviewssentiment polarity, so even the purely unsupervisedthan others, we limited ourselves to including atmodel's results look promising. However, they alsoshow more genre and content effects. For exam-most 30 reviews from any movie in the collectionWe build a fixed dictionary of the 5.000 most fre-ple, the sentiment enriched vectors for ghastly aretruly semantic alternatives to that word whereas thequent tokens, but ignore the 50 most frequent termsvectors without sentiment also contain some contentfrom the original full vocabulary. Traditional stopwords that tend to have ghastly predicated of themword removal was not used because certain stop Of course this is only an impressionistic analysis ofwords(e. g. negating words) are indicative of senti-ment. Stemming was not applied because the model a few cases, but it is helpful in understanding whythe sentiment-enriched model proves superior at thelearns similar representations for words of the samestem when the data suggests it. additionally, besentiment classification results we report nextcause certain non- word tokens(eg.“!and“-)”)4.2 Other Word representationsare indicative of sentiment we allow them in our vo-cabulary. Ratings on IMDB are given as star valuesFor comparison we implemented several alternativeuse as document labels when training our mode co vector space models that are conceptually similar to(∈{1,2,…,10}) whichwe linearrly map to 0, 1 toour own as discussed in section 2The semantic component of our model does not Latent Semantic Analysis ( LSA; Deerwester etrequire document labels. We train a variant of our al., 1990) We apply truncated svd to a tf idfmodel which uses 50.000 unlabeled reviews in addi- weighted. cosine normalized count matrix. whichtion to the labeled set of 25,000 reviews. The unla- is a standard weighting and smoothing scheme forOur modelOur modelSentiment Semantic Semantic only lsabittersweetthoughtfulpoeticheartbreakingwarthricalmelancholy happinessayerpoetrytendernessgeentreprotourcompassionateonelinessvividembarrassinglypredatorshi1deoustritehideousIneptghastlyverelyatrociousbaffledgrotesqueappallingsmackunsuspectinglamepassableuninspiredlaughableunconvincingngfatlackluster unimaginatiyamateurishblanduninspiredclichedforgettabawfulinsipidmediocreromanceromanceromancelovecharmingscrewslromantic sweetdelightfulgrantbeautifulsweetcomediesrelationshipchemistrycomedyTable 1: Similarity of learned word vectors. Each target word is given with its five most similar words using cosinesimilarity of the vectors determined by each model. The full version of our model (left)captures both lexical similarityas well as similarity of sentiment strength and orientation. Our unsupervised semantic component(center)and LSa(right)capture semantic relationsVSM induction(Turney and Pantel, 2010)of such weighting variants for sentiment tasks.Latent Dirichlet Allocation (LDA; Blei etals 2003) We use the method described in sec4.3 Document Polarity Classificationtion 2 for inducing word representations from the Our first evaluation task is document-level sentitopIc matrix. To train the 50-topic LDA model we ment polarity classification. A classifier must pre-use code released by Blei et al.(2003). We use the dict whether a given review is positive or negativesame 5,000 term vocabulary for LDa as is used forgiven the review texttraining word vector models. We leave the ldaGiven a document's bag of words vector v, wehyperparameters at their default values. though obtain features fromsome work suggests optimizing over priors for LDAvector product R, where v can have arbitrary tfidf1s important(Wallach et al, 2009)weighting. We do not cosine normalize v, insteadWeighting Variants We evaluate both binary (b) applying cosine normalization to the final featureterm frequency weighting with smoothed delta idf vector Ru. This procedure is also used to obtain(At) and no idf(n) because these variants worked features from the LDa and LSa word vectors. Inwell in previous experiments in sentiment(Mar- preliminary experiments, we found ' weightingtinea and Finin, 2009; Pang et aL., 2002). In all to work best for v when generating document feacases, we use cosine normalization(c). Paltoglou tures via the product Ru. In all experiments, weand Thelwall(2010) perform an extensive analysis,47use this weighting to get multi-word representationsFeaturesPL04 Our Dataset SubjectivityBag of Words(bnc)854587.8087.77Bag of words(b△t'c)858088.2385.65LDA66.7067.4266.65LSA84.5583.968282Our Semantic87.1087.3086.65Our full84.65874486.19Our full. Additional unlabeled87.0587.9987.22Our Semantic Bag of Words(bnc)88.3088.2888.58Our Full+ Bag of words(bnc)87.8588.338845Our Full, Add'l Unlabeled+ Bag of Words(bnc)88.9088.8988.13Bag of Words svM(Pang and Lee, 2004)87.15N/A90.00Contextual Valence Shifters(Kennedy and Inkpen, 2006) 86.20N/An/Atf Aidf Weighting(Martineau and Finin, 2009)88.10N/AN/AAppraisal Taxonomy(Whitelaw et al., 2005)90.20n/An/ATable 2: Classification accuracy on three tasks. From left to right the datasets are: A collection of 2,000 movie reviewsoften used as a benchmark of sentiment classification(Pang and Lee, 2004), 50,000 reviews we gathered from IMDBand the sentence subjectivity dataset also released by (pang and Lee, 2004). All tasks are balanced two-class problemsfrom word vectorsOur method's features clearly outperform those ofother VSMs, and perform best when combined with4.3.1 Pang and Lee movie review datasetthe original bag of words representation. The variThe polarity dataset version 2.0 introduced by Pang ant of our model trained with additional unlabeledand Lee(2004) consists of 2, 000 movie reviewsdata performed best, suggesting the model can effecwhere each is associated with a binary sentiment po- tively utilize large amounts of unlabeled data alonglarity label. We report 10-fold cross validation re- with labeled examples. Our method performs comsults using the authors' published folds to make ourpetitively with previously reported results in spite ofresults comparable with others in the literature. Weuse a linear support vector machine(SVM) cLassifier our restriction to a vocabulary of only 5,000 wordsWe extracted the movie title associated with eachtrained with LIBLINEAr(Fan et aL., 2008), and setreview and found that 1. 299 of the 2 000 reviews inthe Svm regularization parameter to the same value the dataset have at least one other review of the sameused by Pang and Lee(2004)movie in the dataset. Of 406 movies with multipleTable 2 shows the classification performance ofreviews, 249 have the same polarity label for all ofour method, other vsMs we implemented, and pre-their reviews. Overall, these facts suggest that, relaviously reported results from the literature. Bag oftive to the size of the dataset, there are highlwords vectors are denoted by their weighting notalated examples with correlated labels. This is a nattion Features from word vector learner are denotedural and expected property of this kind of documentby the learner name. As a control we trained ver-collection, but it can have a substantial impact onsions of our model with only the unsupervised seperformance in datasets of this scale. In the randommantic component, and the full model(semantic andfolds distributed by the authors, approximately 50%sentiment). We also include results for a version ofof reviews in each validation fold s test set have aour full model trained with 50.000 additional unla-review of the same movie with the same label in thebeled examples. Finally, to test whether our modtraining set. Because the dataset is small, a learnerIs' representations complement a standard bag ofmay perform well by memorizing the association bewords, we evaluate performance of the two featuretween label and words unique to a particular movierepresentations concatenated(e.g, character names or plot terms)http://www.cs.cornell.edu/people/pabo/movie-review-data148We introduce a substantially larger dataset, whichuses disjoint sets of movies for training and testing. is substantially different from the review classifica-These steps minimize the ability of a learner to rely tion task because it uses sentences as opposed to enon idiosyncratic word-class associations, thereby tire documents and the target concept is subjectivityfocusing attention on genuine sentiment featuresinstead of opinion polarity. We randomly split the10,000 examples into 10 folds and report 10-fold4.3.2 IMDB Review Datasetcross validation accuracy using the SVM trainingWe constructed a collection of 50.000 reviews from protocol of Pang and Lee(2004)IMDB, allowing no more than 30 reviews per movieTable 2 shows classification accuracies from theThe constructed dataset contains an even number of sentence subjectivity experiment. Our model againpositive and negative reviews, so randomly guessing provided superior features when compared againstyields 50% accuracy. Following previous work on other VSMs. Improvement over the bag-of-wordspolarity classification, we consider only highly po- baseline is obtained by concatenating the two featurelarized reviews. A negative review has a score <4 vectorsout of 10, and a positive review has a scoreout of 10. neutral reviews are not included in thedataset In the interest of providing a benchmark for5 Discussionfuture work in this area, we release this dataset tothe publicWe presented a vector space model that learns wordWe evenly divideddataset into training and representations captuing semantic and sentiment intest sets. The training set is the same 25,000 laformation The models probabilistic foundationbeled reviews used to induce word vectors with our gives a theoretically justified technique for wordmodel. We evaluate classifier performance after vector induction as an alternative to the overwhelm-cross-validating classifier parameters on the traininging number of matrix factorization-based techniquesset,again using a linear SVM in all cases. Table 2 commonly used. Our model is parametrized as ashows classification performance on our subset oflog-bilinear model following recent success in us-IMDB reviews. Our model showed superior per- ing similar techniques for language models(bengioformance to other approaches, and performed best et al., 2003; Collobert and Weston, 2008; Mnih andwhen concatenated with bag of words representaHinton, 2007), and it is related to probabilistic latenttion. Again the variant of our model which utilized topic models(blei et al., 2003; Steyvers and Grifextra unlabeled data during training performed bestfiths, 2006). We parametrize the topical componentDifferences in accuracy are small. but, becauseof our model in a manner that aims to capture wordour test set contains 25.000 examples, the variancerepresentations instead of latent topics. In our ex-of the performance estimate is quite low. For ex-periments, our method performed better than LDaample, an accuracy increase of0. 1% corresponds towhich models latent topics directlcorrectly classifying an additional 25 reviewsWe extended the unsupervised model to incorporate sentiment information and showed how this4.4 Subjectivity Detectionextended model can leverage the abundance ofAs a second evaluation task we performed sentencesentiment-labeled texts available online to yieldlevel subjectivity classification. In this task, a clas-word representations that capture both sentimentand semantic relations We demonstrated the util-sifier is trained to decide whether a given sentence issubjective, expressing the writer's opinions, or obity of such representations on two tasks of senti-jective, expressing purely facts. We used the datasetment classification, using existing datasets as wellof Pang and Lee(2004), which contains subjectiveas a larger one that we release for future researchThese tasks involve relatively simple sentiment insentences from movie review summaries and objective sentences from movie plot summaries. This taskformation, but the model is highly flexible in thisregard; it can be used to characterize a wide varietydAtaset and further details are available online at: of annotations, and thus is broadly applicable in thehttp://www.andrew-maas.net/data/sentiment4o growing areas of sentiment analysis and retrievalAcknowledgmentsF Li, M. Huang, and X Zhu. 2010. Sentiment analysisThis work is supported by the DaRPA Deep learnwith global topics and local dependency. In Proceed-ing program under contract number FA8650-10-Cings of aa.al, pages 1371-13767020, an NSF Graduate Fellowship awarded to amC Lin and Y He. 2009. Joint sentiment/topic model forand onr grant No. N00014-10-1-0109 to CPsentiment analysis. In Proceeding of che 18ch ACMConference on Information and Knowledge ManageReferencesment, pages 375-384J Martineau and T Finin. 2009. Delta tfidf: an improvedCO.Alm, D.Roth,andR. Sproat. 2005. Emotions from feature space for sentiment analysis. In Proceeding,text: machine learning for text-based emotion predic- of the 3rd AAal International Conference on Weblogstion. In Proceedings of HLT/EMNLP, pages 579-586and social media, pages 258-261A. Andreevskaia and S Bergler. 2006. Mining Word- A Mnih and G E Hinton. 2007. Three new graphicalNet for fuzzy sentiment: sentiment tag extraction frommodels for statistical language modelling. In ProceedWordNet glosses. In Proceedings of the europeanings of the ICML, pages 641-648ACL, pages 209-216G. Paltoglou and M. Thelwall. 2010. A study of informa-Y Bengio, R. Ducharme, P. Vincent, and C. Jauvin 2003tion retrieval weighting schemes for sentiment analya neural probabilistic language model. Journal ofMa- siS. In Proceedings of the ACl, pages 1386-1395chine learning research, 3: 1137-1155, AugustB. Pang and L. Lee. 2004. A sentimental education:D. M. Blei, A.Y.Ng, and M. I Jordan. 2003. Latent sentiment analysis using subjectivity summarizationdirichlet allocation. Journal of machine learning Rebased on minimum cuts. In Proceedings of the ACL,search, 3: 993-1022, Maypages271-278.J Boyd-Graber and P Resnik. 2010. Holistic sentiment B Pang and L. Lee. 2005. Seeing stars: exploiting classanalysis across languages: multilingual supervised larelationships for sentiment categorization with respecttent Dirichlet allocation. In ProceedingS o/EMNLP, to rating scales. In Proceedings o/ACL, pages 115pages 45-55R. Collobert and J. Weston. 2008. A unified architecture B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbsfor natural language processing. In Proceedings of theup? sentiment classification using machine learning/CMl, pages 160-167techniques. In Proceedings of EMNLP, pages 79-86S Deerwester, S. T. Dumais, G.W. Furnas, T.K. Lan- C. Potts. 2007. The expressive dimension. Theoreticallauer, and R Harshman. 1990. Indexing by latent seLinguistics, 33: 165-197mantic analysis. Journal of the American Society for B Snyder and R. Barzilay. 2007. Multiple aspect rankInformation Science, 41: 391-407, September.ing using the good grief algorithm. In Proceedings ofE. Fan,K. W. Chang, C.J. Hsieh, X.R. Wang,NAACL300-307C.J. Lin. 2008. LIBLINEAR: A library for large lin- M. Steyvers and T. L. Griffiths. 2006. Probabilistic topicear classification. The Journal of machine learning models. In T Landauer. D McNamara. S. Dennis andResearch, 9: 1871-1874, AugustW. Kintsch, editors Latent Semantic Analysis. A RoadJ.R. Finkel and C. D. Manning. 2009. Joint parsing and to meaninnamed entity recognition. In Proceedings Of NAACL, J. Turian, L. Ratinov, and Y. Bengio. 2010. Word reppages 326-334resentations: A simple and general method for semi-A B. Goldberg and J. Zhu. 2006. Seeing stars when supervised learning. In Proceedings of the ACL, pagethere arent many stars: graph-based semi-supervised384394learning for sentiment categorization. In TextGraphs: P. D. Turney and P. Pantel. 2010. From frequency toHLT/NAACL Workshop on Graph-based Algorithmsmeaning: vector space models of semantics. Journalfor Nalural Language Processing, pages 45-52o f artificial Intelligence research, 37: 141-188T. Jay. 2000. Why We Curse: A Neuro-Psycho- H. Wallach. D. Mimno. and A McCallum. 2009. Re-Social Theory of speech. John Benjamins, Philadelthinking LDA: why priors matter. In Proceedings ofphia/AmsterdamNIPS, pages 1973-1981D.Kaplan. 1999. What is meaning? Explorations in the C.Whitelaw, N Garg, and S Argamon. 2005. Using aptheory of Meaning as Use. Brief version- draft Ipraisal groups for sentiment analysis. In ProceedingsMS. UCLAof clkm, pages 625-631A. Kennedy and D. Inkpen. 2006. Sentiment clas- T. Wilson, J. Wicbe, andR. Hwa. 2004. Just how madsification of movie reviews using contextual valenceare you? Finding strong and weak opinion clauses InersComputational Intelligence, 22: 110-125Proceedings of AAAl, pages 761-76950
用户评论