Online Store - 8925533488 /89

Chennai - 8925533480 /81

Hyderabad - 8925533482 /83

Vijayawada -8925533484 /85

Covai - 8925533486 /87

Understanding and clustering hashtags according to their word distributions using NLP in Machine Learning

( 0 Rating )
Shape Image One
0 student


Hashtags (single tokens often composed of natural language n-grams or abbreviations, prefixed with the character ‘#’) are ubiquitous on social networking services, particularly in short textual documents (a.k.a. posts). Authors use hashtags to diverse ends, many of which can be seen as labels for classical NLP tasks: disambiguation (chips #futurism vs. chips #junkfood); identification of named entities (#sf49ers); sentiment (#dislike); and topic annotation (#yoga). Hashtag prediction is the task of mapping text to its accompanying hashtags. In this work we propose a novel model for hashtag prediction, and show that this task is also a useful surrogate for learning good representations of text. At last we are predicting, hashtag based detailed query then we are showing the result as whether it will be positive or negative using svm and random forest algorithm.


Existing System:

In existing system we show that our method outperforms existing unsupervised (word2vec) and supervised (WSABIE (Weston et al., 2011)) embedding methods, and other baselines, at the hashtag prediction task. We then probe our model’s generality, by transferring its learned representations to the task of personalized document recommendation: for each of M users, given N previous positive interactions with documents (likes, clicks, etc.), predict the N + 1’ th document the user will positively interact with. To perform well on this task, the representation should capture the user’s interest in textual content. We find representations trained on hashtag prediction outperform representations from unsupervised learning, and that our convolutional architecture performs better than WSABIE trained on the same hashtag task.



  1. It will ignores word order information, and so may have less modeling power than our approach.
  2. The text can hold acronyms like “tfb”, concatenated phrases like “ilikeitwhen” or it can contain spelling mistakes.
  3. Due to Twitter slang particularities, even the most popular terms can be cryptic to users, and even more so to automatic text processing applications.


Proposed System:

In proposed work, we are applying 3 type of datasets like twitter, Flickr and YouTube. Then asking question to type of hashtag test data. It will predict similar type of hashtag with detailed description. Unsupervised word embedding methods train with a reconstruction objective in which the embedding are used to predict the original text. For example, word2vec tries to predict all the words in the document, given the embedding of surrounding words. We argue that hashtag prediction provides a more direct form of supervision: the tags are a labeling by the author of the salient aspects of the text. Hence, predicting them may provide stronger semantic guidance than unsupervised learning alone. The abundance of hashtags in real posts provides a huge labeled dataset for learning potentially sophisticated models.



  1. The results of the clustering show that it is possible to identify semantically related hashtags.
  2. For each cluster we extract the top terms, i.e. the most frequent terms in the virtual documents of the cluster.
  3. These top terms are the most representative for the cluster, and fulfill their role as explanatory terms.
  4. We also extract top hashtags within a cluster; they are obtained by ranking all the hashtags in the cluster by an importance score.



System Architecture:

Understanding Hashtags according to their word distributions using NLP


Hardware and Software Requirements:


  1. OS – Windows 7,8 or 10 (32 or 64 bit)
  2. RAM – 4GB



  1. Python IDLE
  2. Anaconda – Jupyter Notebook


Python Package:

  1. Numpy – Numerical Python
  2. Pandas – For Reading the Data
  3. Scikit-Learn
  4. NLTK Tool – For Pre-processing
  5. Algorithm Packages (Support Vector Machine and Random forest)
  6. Matplotlib
  7. Seaborn



  1. Dataset Collection
  2. Pre-processing
  3. Clustering
  4. Statistical Analysis
Curriculum is empty

pantech team

Agile Project Expert

Course Rating

0.00 average based on 0 ratings

Course Preview
  • Price
  • Instructor pantech team
  • Duration 15 Hrs
  • Enrolled 0 student
  • Access 3 Months

More Things You Might Like This


Student Performance Prediction using Machine Learning

Abstract: Although the educational level of the Portuguese population has improved in the last decades, the statistics keep Portugal at Europe’s tail end due to its high student failure rates. In particular, lack of success in the core classes of Mathematics and the Portuguese language is extremely serious. On the other hand, the fields of


Student feedback analysis

Abstract: Advances in natural language processing (NLP) and educational technology, as well as the availability of unprecedented amounts of educationally-relevant text and speech data, have led to an increasing interest in using NLP to address the needs of teachers and students. Educational applications differ in many ways, however, from the types of applications for which


Machine Learning based Regression Model for Prediction of Soil Surface Humidity over Moderately Vegetated Fields

Abstract: Agriculture is one of the major revenue producing sectors of India and a source of survival. Numerous seasonal, economic and biological patterns influence the crop production but unpredictable changes in these patterns lead to a great loss to farmers. These risks can be reduced when suitable approaches are employed on data related to soil

Open Whatsapp Chat
Need Any Help?
Welcome to Pantech eLearning!..

How can i help you?