“... Our models should ultimately be able to learn abstractions that are not specific to the structure of any language but that can generalise to languages with different properties. ...”
“of any language” ... ex nihil? Seulement avec un assez grand corpus bien structuré, cvd déjà analysé selon l’architecture grammaticale; sinon, bien naïf, je dirais, avec pas mal des réductionnismes inconsiderés, pas trop rare dans la vision du monde #linguistique d’un Anglo-saxon.
Build an Abstractive Text Summarizer in 94 Lines of #tensorflow !! (Tutorial 6)
Build an Abstractive Text Summarizer in 94 Lines of Tensorflow !! (Tutorial 6)This tutorial is the sixth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would build an abstractive text summarizer in tensorflow in an optimized way .Today we would go through one of the most optimized models that has been built for this task , this model has been written by dongjun-Lee , this is the link to his model , I have used his model model on different datasets (in different languages) and it resulted in truly amazing results , so I would truly like to thank him for his effortI have made multiple modifications to the model to enable it to enable it to run seamlessly on google colab (link to my model) , and i have hosted the data onto (...)
Beam Search & Attention for text summarization made easy (Tutorial 5)
This tutorial is the fifth one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would discuss some useful modification to the core RNN seq2seq model we have covered in the last tutorialThese Modifications areBeam SearchAttention ModelAbout SeriesThis is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , you don’t need to download the data nor do you need to run the code locally on your device , as data is found on google drive , (you can simply copy it to your google drive , learn more here) , and the code for this series is written in Jupyter notebooks to run on google colab can be found hereWe have covered so far (code for this series can be found here)0. (...)
(tutorial 3)What is seq2seq for text summarization and why
This tutorial is the third one from a series of tutorials that would help you build an abstractive text summarizer using tensorflow , today we would discuss the main building block for the text summarization task , begining from RNN why we use it and not just a normal neural network , till finally reaching seq2seq modelAbout SeriesThis is a series of tutorials that would help you build an abstractive text summarizer using tensorflow using multiple approaches , you don’t need to download the data nor you need to run the code locally on your device , as data is found on google drive , (you can simply copy it to your google drive , learn more here) , and the code for this series is written in Jupyter notebooks to run on google colab can be found hereWe have covered so far (code for this series can (...)
Chars2vec: character-based language model for handling real world texts with spelling errors and…
Chars2vec: character-based language model for handling real world texts with spelling errors and human slangThis paper describes our open source character-based language model chars2vec. This model was developed with Keras library (TensorFlow backend) and now is available for Python 2.7 and 3.0+.IntroductionCreating and using word embeddings is the mainstream approach for handling most of the #nlp tasks. Each word is matched with a numeric vector which is then used in some way if the word appears in text. Some simple models use one-hot word embeddings or initialise words with random vectors or with integer numbers. The drawback of such models is obvious – such word vectorisation methods do not represent any semantic connections between words.There are other language models, called (...)
My Hackathon Experiences
My experiments in the world of hackathon started out of my boredom in office work and have turned out to be a rich collection of experiences. You can find the solutions at my github.RBL bank hackathon(3 days)Problem statement — Make a data science solution with customer dataOffline | Prize pool — 2 lakhsWhat I likedGreat food arrangementComfy workplaceCould have been betterForced to use API provided for fetching dataThe API didn’t work for 1.5 daysAPI have transaction data of only 1 user of just a few months. No machine learning was possible over it.No guidance provided on what to do with so less dataTeam presentations were private with the JuryCoinberg hackathonCoinberg hackathon(2 days)Problem statement — Cryptocurrency — Arbitrage trading | Sentiment analysis | Portfolio management | Trend (...)
Enriching Word Vectors with Subword Information [PAPER SUMMARY]
Enriching Word Vectors with Subword Information [Google Colab Implementation & Paper Summary]About the Authors:This paper was published by a group of researchers from FAIR (Facebook AI research). The original authors are Piotr Bojanowski, Edouard Grave, Armand Joulin and Tomas Mikolov.The ready-to-run code for this paper is available here on Google Colab.The Basic Idea behind Word Vectors:For most of the Natural Language Processing related tasks like text classification, text summarization, text generation etc, we need to perform various computations in order to achieve maximum precision on these tasks. In order to perform these computations, we need a numerical based representation for various components of Language like words, sentences and syllables.We assign multi-dimensional (...)
What is this?
BPEmb is a collection of pre-trained subword #embeddings in 275 #languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.
Subwords allow guessing the meaning of unknown / out-of-vocabulary words. E.g., the suffix -shire in Melfordshire indicates a location.
Byte-Pair Encoding gives a subword segmentation that is often good enough, without requiring tokenization or morphological analysis. In this case the BPE segmentation might be something like melf ord shire.
Pre-trained byte-pair embeddings work surprisingly well, while requiring no tokenization and being much smaller than alternatives: an 11 MB BPEmb English model matches the results of the 6 GB FastText model in our evaluation.
A Bossy Sort of Voice – gender bias in Harry Potter
Here’s what I take from this data:
Harry’s unique descriptors are often about his observations and thoughts. This makes complete sense: he is the main character in all of the books, and Rowling tells the story largely through him.
Ron’s unique descriptors are often about his behavior. Irritation, bellowing, blurting, grumpy. Because, that’s Ron’s personality.
Hermione’s unique descriptors often don’t establish her as the “greatest witch of her age”, more knowledgeable and clever than her exceptional friends. Instead, her unique words slot her squarely as a traditional female character. Especially in Book 7 when she’s owning some really bad witches and wizards.
Chatbots are conversational interfaces meant to assist individuals in interacting with larger organizations.If you’re looking for help while browsing a website, you might end up texting with a chatbot. They’re a bit more dynamic a problem-solver than a simple FAQ webpage. If you’re looking to speak with a customer service representative, a chatbot might try to help you out on its own, or it could help direct you to the proper department within an organization, where a human operator can provide assistance.As this technology becomes more and more integrated into daily commerce, it’s worth asking: what barriers does the technology face today, what improvements are being made, and what can we expect of human-robot discourse in the future?LimitationsIt won’t take you all too long, under a (...)
OpenAI’s GPT-2: the model, the hype, and the controversy
if we had an open-source model that could generate unlimited human-quality text with a specific message or theme, could that be bad?
I think the answer is yes. It’s true that humans can already write fake news articles, and that governments already recruit thousands of people to write biased comments tailored towards their agenda. But an automated system could: (1) enable bad actors, who don’t have the resources to hire thousands of people, to wage large-scale disinformation campaigns; and (2) drastically increase the scale of the disinformation campaigns already being run by state actors. These campaigns work because humans are heavily influenced by the number of people around them who share a certain viewpoint, even if the viewpoint doesn’t make sense. Increasing the scale should correspondingly increase the influence that governments and companies have over what we believe.
To combat this, we’ll need to start to researching detection methods for AI-generated text.
How do Your Favorite Books Compare in a #vr World?
Visualizing semantic relationships using spatial embeddingsSemantic Spatial Embedding in VR? Exploring Content ( the traditional way )If you could search your own library of books digitally, how would you do it?A simple text search could work but that approach typically only answers the question “where are these terms mentioned in my books?”There are some questions that would be hard for a text search to answer:How would I find content where certain keywords may not be present?How do all my books relate to each other?Which of the books that I have not read yet would be a good place to start?These questions are trying to access the “semantics” or the “meaning” within content regardless of the actual words used.Beyond the Simple Text SearchUnicon (my most excellent employer) authorized me to spend (...)
Are U.S. newspapers biased against Palestinians? Analysis of 100,000 headlines in top dailies says, Yes – Mondoweiss
A study released last month by 416Labs, a Toronto-based consulting and research firm, supports the view that mainstream U.S. newspapers consistently portray Palestine in a more negative light than Israel, privilege Israeli sources, and omit key facts helpful to understanding the Israeli occupation, including those expressed by Palestinian sources.
The largest of its kind, the study is based on a sentiment and n-gram analysis of nearly a hundred thousand headlines in five mainstream newspapers dating to 1967. The newspapers are the top five U.S. dailies, The New York Times, Washington Post, Wall Street Journal, Chicago Tribune, and the Los Angeles Times.
Headlines spanning five decades were put into two datasets, one comprising 17,492 Palestinian-centric headlines, and another comprising 82,102 Israeli-centric headlines. Using Natural Language Processing techniques, authors of the study assessed the degree to which the sentiment of the headlines could be classified as positive, negative, or neutral. They also examined the frequency of using certain words that evoke a particular view or perception.
Key findings of the study are:
Since 1967, use of the word “occupation” has declined by 85% in the Israeli dataset of headlines, and by 65% in the Palestinian dataset;
Since 1967, mentions of Palestinian refugees have declined by an overall 93%;
Israeli sources are nearly 250% more likely to be quoted as Palestinians;
The number of headlines centering Israel were published four times more than those centering Palestine;
Words connoting violence such as “terror” appear three times as much as the word “occupation” in the Palestinian dataset;
Explicit recognition that Israeli settlements and settlers are illegal rarely appears in both datasets;
Since 1967, mentions of “East Jerusalem,” distinguishing that part of the city occupied by Israel in 1967 from the rest of the city, appeared only a total of 132 times;
The Los Angeles Times has portrayed Palestinians most negatively, followed by The Wall Street Journal, Chicago Tribune, Washington Post, and lastly The New York Times;
Coverage of the conflict has reduced dramatically in the second half of the fifty-year period.
Word Embeddings in #nlp and its Applications
Word embeddings are basically a form of word representation that bridges the human understanding of language to that of a machine. Word embeddings are distributed representations of text in an n-dimensional space. These are essential for solving most NLP problems.Domain adaptation is a technique that allows Machine learning and Transfer Learning models to map niche datasets that are all written in the same language but are still linguistically different. For example, legal documents, customer survey responses, and news articles are all unique datasets that need to be analyzed differently. One of the tasks of the common spam filtering problem involves adopting a model from one user (the source distribution) to a new one who receives significantly different emails (the target (...)
Summarization With Wine Reviews Using #spacy
“You don’t need a silver fork to eat good food.”? Introduction Wine ReviewsIn this article, I will try to explore the Wine Reviews Dataset. It contains 130k of reviews in Wine Reviews. And at the end of this article, I will try to make simple text summarizer that will summarize given reviews. The summarized reviews can be used as a reviews title also.I will use spaCy as natural language processing library for handling this project.? Object Of This ProjectThe objective of this project is to build a model that can create relevant summaries for reviews written on Wine reviews. This dataset contains above 130k reviews, and is hosted on Kaggle.What Is Text Summarization?Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged (...)
Five reasons why #language apps are still behind the curve
Natural Language Processing has made tremendous strides the last few years, undoubtedly. One should expect the next five or ten years to see a Cambrian Explosion of sorts in the sector, not just in language-focused technology like #translation and language learning, but across the continuum of modern technology. That being said, we are still a little out of step on some critical issues. This is not to say there are not teams of brilliant and motivated people on a definite road to solving these problems, but the solutions have not hit the public — yet.1. English-Centric DevelopmentThere is evidence of this in a very pervasive application, Google Translate. For anyone familiar with multiple languages, it’s unfortunately clear the majority of translation perfection has been through Anglo (...)
How Text Analytics Will Empower #healthcare Providers
Since Big Data has proven its usability in retail, marketing, and other areas, healthcare managers are now thinking about how to reap the benefits of this technology for their own problems. Artificial intelligence in the form of Natural Language Processing (NLP) can improve critical aspects of the patient-doctor relationship and can even go beyond this, simplifying the process of insurance payment.The expected advancement comes from making the clinical documentation more accessible through automatic indexing, thus adding search-ability. Another growth direction is the automatic voice-to-text feature which will enable the creation of automated digital records while allowing medical staff to focus their attention on patients instead of writing. This is solving the problem that more than (...)
Review of Microsoft’s Desi Chatbot, Ruuh
In our day to day life, it’s easy to take our ability to converse for granted. Thanks to the enormous amount of data we collect every second of our life through our sensory organs, we possess a huge amount of data which makes our ability to converse fluid. The emotional factor that we add in our conversation just makes it “human”. We all converse with our friends, families, strangers (sometimes) and easily understand the context of the discussion, the emotion, hidden meanings, etc. As simple as it may be for humans, it definitely isn’t the case with computers.With growing veins of Artificial Intelligence, we are getting surrounded by virtual assistants like Siri, Alexa, Cortana, etc. and tons of #chatbots and have transitioned into the “Age of Chatbots”. Let me draw the line clearly: A (...)
Various Optimisation Techniques and their Impact on Generation of Word Embeddings
Shameless plugin: We are a machine learning data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick.Welcome to the third part of the five series tutorials on Machine Learning and its applications. Check out Dataturks, a data annotations tool to make your ML life simpler and smoother.Word embeddings are vectorial representations that are assigned to words, that have similar contextual usages. What is the use of word embeddings you might say? Well, if I am talking about Messi and immediately know that the context is football… How is it that happened? Our brains have associative memories and we associate Messi with football…To achieve the same, that is group similar words, we use embeddings. Embeddings, (...)
Introducing the First Natural Language to #sql #api
Talk to your data!You can view the API documentation here: ▻http://docs.dhignite.com/What if, instead of running complex SQL scripts, you could simply ask your database a question? What were my sales yesterday? What were my top selling products in June?This natural search capability has become more common over the past year as companies such as ThoughtSpot, Salesforce, and Tableau all develop similar technologies. This trend of “data democratization” is forcing us to leverage existing technology in new ways.The idea that data should be accessible to the average end user, not confined to data analysts or forced to go through a complex queuing system where time and convenience often rule the day, is our new expectation.Data should be accessible beyond the dashboards and beyond analysts. That (...)
Understanding Word Embeddings
Understanding how different words are related based on the context they are used is an easy task for us humans. If we take an example of articles then we read a lot of articles related to different topics. In almost all the articles where an author is trying to teach you a new concept then the author will try to use the examples which are already known to you, to teach you any new concept. In a similar way computer also needs a way where they can learn about a topic and where they can understand how different words are related.Let me begin with the concept of language, these amazing different languages we have. We use that to communicate with each other and share different ideas.But how do we explain a language in a better way? Some time back I was reading this book “Sapiens”, where the (...)