In this paper, we are mainly focusing on the duplicate questions detection. The main idea is to ... We finally build a search engine based on our labeled test dataset, thus, it can show .... recurrent-neural-networks-python-keras/ ·  http://scikit
answers. Our model tries to learn these patterns. 1.1 Data. The Quora duplicate questions public dataset contains 404k pairs of Quora questions.1 In our experiments .... regularizing effect through fresh, more objective and less contextualized repres
AbstractâTor is a distributed onion-routing network used for achieving anonymity and resisting censorship online. Because of Tor's growing popularity, it is attracting increasingly larger threats against which it was not securely designed. In this
Feb 11, 2002 - Abstract. The SafeWeb anonymizing system has been lauded by the press and loved by its users; self-described as âthe most widely used online privacy service in the world,â it served over. 3,000,000 page views per day at its peak. S
In this paper I describe a model which is used to generate novel image captions for a previously unseen image by using a combination of a recurrent neural ...
Dec 2, 1999 - multiply-redundant bit-serial processing elements using a variety of fault-tolerant strategies;. â¢ memory cells; ... over long distances. The ultimate limit is the speed of light, but a much more .... term 'very large' used to mean an
Aug 29, 2018 - With regard to the web, we profit from the StartPage9 search engine for finding documents ... 9www.startpage.com ...... The Stanford CoreNLP.
matching of parse trees using the generative model ... Some fuzzy string matching measures such as. Q Ratio ... The model was implemented in Python using.
carsSoWhite trending, the discrimination in Hollywood against women and minorities has been cast in the limelight. ... attributing every spoken line to an actor in the largest analysis of scripts so far. There were .... Convolutional Neural Networ
Overfitting has always been a problem for blackbox learning algorithms such as neural .... mediate tokens don't capture the full history. .... We set to explore if a model trained with stochastic dropout will have a lower rate of gradient decay.
November 16, 2011. Abstract. Many otherwise ... with acids or precisely measuring a small portion of a mi- ......  A. Inc. Mac mini and the environment. http://.
Reduce R-well by making low resistance contact to GND well by making low resistance contact to GND. 10 ... (a) Electrical charge is considered as fluid, which can move from one place to another depending on the difference in their level, of ..... Ans
constructed out of training set, and compared against the test set to figure text which doesn't .... Reduction in Vocab size was pretty much essential, and we did aggressive .... Karpathy's char-rnn, by the Github contributor Sherjil Ozair .
for multi-tasking across Visual Question Answering and Sentiment Classification tasks. We hope that the idea of ... towards using single models â with only one set of weights â for tackling both visual and non- visual question .... tween 6000 and
LSTM. 1 Language Models. Language models compute the probability of occurrence of a number of words in a ..... time-steps. 2.3 Deep Bidirectional RNNs.
parotid or lacrimal gland duct, which was confirmed by the transport of saliva and ... Q6: How much fluid is secreted by the bioengineered salivary and lacrimal ...
Mar 25, 2009 - A student sets up a savings plan to transfer money from his checking account to his savings account. The first week $10 .... ANSWERS. 9-2 Practice. Form K. Arithmetic Sequences. Determine whether each sequence is arithmetic. If so, ide
The SafeWeb anonymizing system has been lauded by the press and loved by its users; self-described as âthe most widely used online privacy service in the world,â it served over 3,000,000 page views per day at its peak. SafeWeb was designed to def
I present a new coreference system based on neural networks that automatically ... and surrounding context, and can capture semantic similarity important for coref- ... entity. For example, the mentions âObamaâ, âthe presidentâ, and âheâ
We used the 840B common crawl GloVe pretrained embeddings https://nlp.stanford.edu/projects/glove/, the starter ... For each question, we summed the 300-.
pseudonymous in Ripple, IOU credit links and transac- tion flows between wallets ... preserving payment mechanisms for Ripple and char- acterizes the privacy challenges faced by the emerging credit networks. Keywords: Credit Networks, Ripple, deanony
Our model obtains an accuracy score comparable to the state-of-the-art models without much fine- tuning. 2 Related Work .... y(4)(dk) = max j y(3)(sj,k) y(5)(dk) = softmax(W(5)y(4)(dk) + W(4)). 3 ... only, and discard parts of blogs and books in any
They are advantageous in that they do not require a word embedding matrix and thus require a lot less ... subword information such as morphenes. They have ...
In the current project, I build a neural network model to prove theorems in logical forms. Particularly ... The current work demonstrates the power of deep neural networks .... developed by Neelakantan, Le and Sutskever . ... During the training,
Quora is a knowledge sharing website where users can ask/answer questions with the option of anonymity .We investigate the problem of Author-identification for Quora answers using deep learning techniques in Natural Language Processing.
We hope to achieve significant precision on the task of identifying users from their writings with the end-goal of recognizing the authors of anonymous answers on Quora. Previous work indicates that writing style harbors essential cues about authors and we believe that deep learning is a powerful tool to extract such features to distinguish between the various writing styles that people have. The work finds applications to several other tasks like Forensic Linguistics, email spam detection, identity tracing in cyber forensics etc.
There has been fair amount of interest in author-identification in previous NLP works with most of the work focussing on manually engineered features:1. Comparing Frequency and Style-Based Features for Twitter Author Identification: Examines author identification in short texts focussing on messages retrieved from Twitter to determine the most effective feature set for recognizing authors look at Bag-of-Words and style-marker features and use SVMs for the classification task 2. A Comparative Study of Language Models for Book and Author Recognition: Evaluates similarity between documents and authors showed that syntactic features are less successful than function words for author attribution 3. A Survey of Modern Authorship Attribution Methods Discusses how this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing over the past few decades
To generate a small-version of the problem we deal with , we select 200 writers from the list of top Quora writers and use RSS feeds to generate a dataset containing exactly 50 answers per user. 1
Figure 1: Dataset details Quora Top writers: they are a group of people with recognized expertise, knowledge and authenticity. Some people in the group are experts in specific fields, while others are simply great writers with a talent for describing the human condition and the world around us. Selecting these people ensures high quality content in the form of a large number of long answers per author.
(a) Word length vs. Frequency
(b) Average answer length distribution
Figure 2: Data statistics We observe that most of the authors in the dataset have an average of less than 150 words per answer, though there are a few authors in the dataset which write very long answers. Also, not surprisingly, the word length vs frequency curve follows a power law.
Technical Approach and Models
The labels are hidden for a fraction of answers for every author to test our final model and the remaining dataset is used to train on the task of author attribution. We used machine learning models on engineered features as well as deep learning models for the task. 2.1
Model 1: Style marker features
The following commonly used style marker features from previous works in author-identification were used for the classification task 1. 2. 3. 4.
Number of words in the answer Fraction of words that were punctuations Average word length Standard deviation of word length 2
5. Number of sentences in the answer 6. Average Sentence length 7. Number of digits in the answer 2.2
Model 2: Word Frequency model
Each answer is modeled by a feature vector of the length of the vocabulary set and contains the counts for each word in the vocabulary set for that answer The vocabulary set is varied by incrementally adding more tokens in the order of decreasing frequency, using the complete vocabulary works best. 2.3
Model 3: LSTM with mean-pooling
The LSTM model is a recurrent neural network with memory units, allowing the cells to remember or forget its previous state, as needed. Notation: xt is the input to the memory cell layer at time t Wi , Wf , Wc , Wo , Ui , Uf , Uc , Uo and Vo are weight matrices bi , bf , bc and bo are bias vectors
Figure 3: LSTM unit Memory Unit Update: ft the candidate value for the states of the First, we compute the values for it , the input gate, and C memory cells’ at time t : (1) it = σ(Wi xt + Ui ht−1 + bi ) ft = tanh(Wc xt + Uc ht−1 + bc ) (2) C Second, we compute the value for ft , the activation of the memory cells forget gates at time t (3) ft = σ(Wf xt + Uf ht−1 + bf ) Given the value of the input gate activation it , the forget gate activation ft and the candidate state ft , we can compute Ct the memory cells’ new state at time t : value C ft + ft ∗ Ct−1 (4) Ct = it ∗ C With the new state of the memory cells, we can compute the value of their output gates and, subsequently, their outputs :
(5) ot = σ(Wo xt + Uo ht−1 + b1 ) (6) ht = ot ∗ tanh(Ct )
Figure 4: Mean Pooling with LSTMs
Mean pooling: The model is composed of a single LSTM layer followed by an average pooling and a logistic regression layer as illustrated in Figure 4. Thus, from an input sequence x0 , x1 , x2 , ..., xn , the memory cells in the LSTM layer will produce a representation sequence h0 , h1 , h2 , ...hn . This representation sequence is then averaged over all time-steps resulting in representation h. Finally, this representation is fed to a logistic regression layer whose target is the class label associated with the input sequence.
The dataset for 200 authors with 50 answers per author was split into 80:10:10 for train, validation and test datasets respectively.
TSNE is a tool to visualize high-dimensional data, converts similarities between data points to joint probabilities and tries to minimize the KL divergence between the joint probabilities of the lowdimensional embedding and the high-dimensional data We visualize the 2 sets of baseline features using TSNE, displaying how well the features separate the various classes. Each scatter point represents an answer and the color represents the author for that answer 4
(a) TSNE Visualization - Style markers
(b) TSNE Visualization - Unigram Features
The Unigram features clearly do a much better job at separating the data as compared to the style markers, confirming the findings in [, 2] 3.2
We use top-k accuracy to evaluate the performance of our models, i.e. the author prediction is considered as correct if the true author belongs to the top-k predictions of the model for a given answer Random Guessing Top-1 Accuracy = 0.505% (198 classes) 3.2.1
Traditional Machine Learning Methods
Various classifiers including random forests, Multinomial Naive Bayes, Adaboost, Gradient Boosting were tested on the feature sets , the best performance was achieved by using Random Forests, Multinomial Naive Bayes also achieved reasonable accuracy. Dataset Training Validation Test
45.51 6.71 6.63
74.14 17.8 17.91
84.70 27.12 27.22
Table 1: Results for model 1 features for best model (Random Forests)
Dataset Training Validation Test
97.44 33.83 32.34
98.44 55.85 55.15
98.72 65.16 66.27
Table 2: Results for model 2 features for best model (Random Forests)
Random forests capture co-occurrence of words, thus giving significant improvement over random guessing and style-based features using the word frequency feature vector. 5
Deep Learning Model - LSTM with mean pooling
Training LSTMs on the entire answers is a computationally expensive task, thus each answer was split into smaller chunks and the model was trained to predict on each chunk rather than every answer. Dataset Training Validation Test
51.32 20.26 20.12
75.6 37.78 36.96
83.98 53.51 53.42
Table 3: Results for LSTM model on chunk size: 50 words Comments: 1. Using random word vector initialization performed better than using pre-trained word vectors from the wikipedia dataset. The reason behind this may be that, for this specific task, words which are semantically very similar (eg. synonyms) might still be required to be separated out in the word vector space defined by the way different people use those words. 2. A direct relation is observed between average answer length and accuracy, indicating that author attribution is easier for longer answers. Authors with an average answer length > 400 have at least 70% accuracy. (See figure 6)
Figure 6: Correlation between accuracy and answer length 3. The performance is much better than random guessing It is lower than the random forest model however which was trained on full answers as compared to chunked answers for the LSTM model training
Although computationally expensive, training on entire answers should improve results as compared to training on individual chunks. The model was prone to overfitting to specific words in the training data which could disambiguate between the authors and thus lead to poor generalization. Using a mean pooling layer after a softmax layer at each neuron rather than directly on the hidden layer output might help overcome this, since softmax will normalize all the hidden layer outputs to sum to 1. Using pre trained word vectors from the wikipedia dataset lead to worse performance than random initialization. However, our dataset was relatively small and results should improve if the word 6
vectors are trained using the author-attribution task on a much larger dataset and initializing the word vectors to these instead.
References  Green, R. M., & Sheppard, J. W. (2013, May). Comparing frequency-and style-based features for twitter author identification. In The Twenty-Sixth International FLAIRS Conference.  Uzuner, zlem, and Boris Katz. ”A comparative study of language models for book and author recognition.” Natural Language ProcessingIJCNLP 2005. Springer Berlin Heidelberg, 2005. 969-980.  Stamatatos, Efstathios. ”A survey of modern authorship attribution methods.” Journal of the American Society for information Science and Technology 60.3 (2009): 538-556.  Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780  Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation  Graves, Alex. Supervised sequence labeling with recurrent neural networks. Vol. 385. Springer, 2012.