technology:natural language processing

  • WordNet — A Lexical Database for English

    « WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing. » #knowledge #graph #thesaurus #text

  • XDL Framework: Delivering powerful Performance for Large-scale Deep Learning Applications

    The Alibaba tech team open sourced its self-developed deep learning framework that goes where others have failedDeep learning AI technologies have brought remarkable breakthroughs to fields including speech recognition, computer vision, and natural language processing, with many of these developments benefiting from the prevalence of open source deep learning frameworks like TensorFlow, PyTorch, and MxNet. Nevertheless, efforts to bring deep learning to large-scale, industry-level scenarios like advertising, online recommendation, and search scenarios have largely failed due to the inadequacy of available frameworks.Whereas most open source frameworks are designed for low-dimensional, continuous data such as in images and speech, a majority of Internet applications deal with (...)

    #artificial-intelligence #data-analysis #machine-learning #deep-learning #hackernoon-top-story

  • Enriching Word Vectors with Subword Information [PAPER SUMMARY]

    Enriching Word Vectors with Subword Information [Google Colab Implementation & Paper Summary]About the Authors:This paper was published by a group of researchers from FAIR (Facebook AI research). The original authors are Piotr Bojanowski, Edouard Grave, Armand Joulin and Tomas Mikolov.The ready-to-run code for this paper is available here on Google Colab.The Basic Idea behind Word Vectors:For most of the Natural Language Processing related tasks like text classification, text summarization, text generation etc, we need to perform various computations in order to achieve maximum precision on these tasks. In order to perform these computations, we need a numerical based representation for various components of Language like words, sentences and syllables.We assign multi-dimensional (...)

    #deep-learning #artificial-intelligence #nlp #machine-learning #data-science

  • BPEmb : Subword Embeddings in 275 Languages
    Benjamin Heinzerling and Michael Strube

    What is this?

    BPEmb is a collection of pre-trained subword #embeddings in 275 #languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

    Subwords allow guessing the meaning of unknown / out-of-vocabulary words. E.g., the suffix -shire in Melfordshire indicates a location.
    Byte-Pair Encoding gives a subword segmentation that is often good enough, without requiring tokenization or morphological analysis. In this case the BPE segmentation might be something like melf ord shire.
    Pre-trained byte-pair embeddings work surprisingly well, while requiring no tokenization and being much smaller than alternatives: an 11 MB BPEmb English model matches the results of the 6 GB FastText model in our evaluation.


  • What Kagglers are using for Text Classification

    Advanced NLP techniques for deep learningWith the problem of Image Classification is more or less solved by Deep learning, Text Classification is the next new developing theme in deep learning. For those who don’t know, Text classification is a common task in natural language processing, which transforms a sequence of a text of indefinite length into a category of text. How could you use that?To find the sentiment of a review.Find toxic comments on a platform like FacebookFind Insincere questions on Quora. A current ongoing competition on KaggleFind fake reviews on websitesWill a text advert get clicked or notAnd much more. The whole internet is filled with text and to categorize that information algorithmically will only give us incremental benefits, to say the least in the field of AI.Here (...)

    #machine-learning #artificial-intelligence #ai #data-science #deep-learning

  • Are U.S. newspapers biased against Palestinians? Analysis of 100,000 headlines in top dailies says, Yes – Mondoweiss

    A study released last month by 416Labs, a Toronto-based consulting and research firm, supports the view that mainstream U.S. newspapers consistently portray Palestine in a more negative light than Israel, privilege Israeli sources, and omit key facts helpful to understanding the Israeli occupation, including those expressed by Palestinian sources.

    The largest of its kind, the study is based on a sentiment and n-gram analysis of nearly a hundred thousand headlines in five mainstream newspapers dating to 1967. The newspapers are the top five U.S. dailies, The New York Times, Washington Post, Wall Street Journal, Chicago Tribune, and the Los Angeles Times.

    Headlines spanning five decades were put into two datasets, one comprising 17,492 Palestinian-centric headlines, and another comprising 82,102 Israeli-centric headlines. Using Natural Language Processing techniques, authors of the study assessed the degree to which the sentiment of the headlines could be classified as positive, negative, or neutral. They also examined the frequency of using certain words that evoke a particular view or perception.

    Key findings of the study are:

    Since 1967, use of the word “occupation” has declined by 85% in the Israeli dataset of headlines, and by 65% in the Palestinian dataset;
    Since 1967, mentions of Palestinian refugees have declined by an overall 93%;
    Israeli sources are nearly 250% more likely to be quoted as Palestinians;
    The number of headlines centering Israel were published four times more than those centering Palestine;
    Words connoting violence such as “terror” appear three times as much as the word “occupation” in the Palestinian dataset;
    Explicit recognition that Israeli settlements and settlers are illegal rarely appears in both datasets;
    Since 1967, mentions of “East Jerusalem,” distinguishing that part of the city occupied by Israel in 1967 from the rest of the city, appeared only a total of 132 times;
    The Los Angeles Times has portrayed Palestinians most negatively, followed by The Wall Street Journal, Chicago Tribune, Washington Post, and lastly The New York Times;
    Coverage of the conflict has reduced dramatically in the second half of the fifty-year period.

  • Text summarizer using deep learning made easy

    In this series we will discuss a truly exciting natural language processing topic that is using deep learning techniques to summarize text , the code for this series is open source , and is found in a jupyter notebook format , to allow it to run on google colab without the need to have a powerful gpu , in addition all data is open source , and you don’t have to download it , as you can connect google colab with google drive and put your data directly onto google drive , without the need to download it locally , read this blog to learn more about google colab with google drive .To summarize text you have 2 main approaches (i truly like how it is explained in this blog)Extractive method , which is choosing specific main words from the input to generate the output , this model tends to work , but (...)

    #machine-learning #seq2seq #text-summarization #artificial-intelligence #ai

  • Summarization With Wine Reviews Using #spacy

    “You don’t need a silver fork to eat good food.”? Introduction Wine ReviewsIn this article, I will try to explore the Wine Reviews Dataset. It contains 130k of reviews in Wine Reviews. And at the end of this article, I will try to make simple text summarizer that will summarize given reviews. The summarized reviews can be used as a reviews title also.I will use spaCy as natural language processing library for handling this project.? Object Of This ProjectThe objective of this project is to build a model that can create relevant summaries for reviews written on Wine reviews. This dataset contains above 130k reviews, and is hosted on Kaggle.What Is Text Summarization?Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged (...)

    #nltk #machine-learning #nlp #hackernoon

  • Five reasons why #language apps are still behind the curve

    Natural Language Processing has made tremendous strides the last few years, undoubtedly. One should expect the next five or ten years to see a Cambrian Explosion of sorts in the sector, not just in language-focused technology like #translation and language learning, but across the continuum of modern technology. That being said, we are still a little out of step on some critical issues. This is not to say there are not teams of brilliant and motivated people on a definite road to solving these problems, but the solutions have not hit the public — yet.1. English-Centric DevelopmentThere is evidence of this in a very pervasive application, Google Translate. For anyone familiar with multiple languages, it’s unfortunately clear the majority of translation perfection has been through Anglo (...)

    #naturallanguageprocessing #nlp #machine-translation

  • How Text Analytics Will Empower #healthcare Providers

    Since Big Data has proven its usability in retail, marketing, and other areas, healthcare managers are now thinking about how to reap the benefits of this technology for their own problems. Artificial intelligence in the form of Natural Language Processing (NLP) can improve critical aspects of the patient-doctor relationship and can even go beyond this, simplifying the process of insurance payment.The expected advancement comes from making the clinical documentation more accessible through automatic indexing, thus adding search-ability. Another growth direction is the automatic voice-to-text feature which will enable the creation of automated digital records while allowing medical staff to focus their attention on patients instead of writing. This is solving the problem that more than (...)

    #nlp #text-analytics #big-data #text-analysis

  • Text Summarization Using #keras Models

    Learn how to summarize text in this article by Rajdeep Dua who currently leads the developer relations team at Salesforce India, and Manpreet Singh Ghotra who is currently working at Salesforce developing a machine learning platform/APIs.Text summarization is a method in natural language processing (NLP) for generating a short and precise summary of a reference document. Producing a summary of a large document manually is a very difficult task. Summarization of a text using machine learning techniques is still an active research topic. Before proceeding to discuss text summarization and how we do it, here is a definition of summary.A summary is a text output that is generated from one or more texts that conveys relevant information from the original text in a shorter form. The goal of (...)

    #artificial-intelligence #machine-learning #keras-models #deep-learning

  • Amazon, AI and Medical Records: Do the Benefits Outweigh the Risks? - Knowledge Wharton

    Last month, Amazon unveiled a service based on AI and machine-learning technology that could comb through patient medical records and extract valuable insights. It was seen as a game changer that could alleviate the administrative burden of doctors, introduce new treatments, empower patients and potentially lower health care costs. But it also carries risks to patient data privacy that calls for appropriate regulation, according to Wharton and other experts.

    Branded Comprehend Medical, the Amazon Web Services offering aims “to understand and analyze the information that is often trapped in free-form, unstructured medical text, such as hospital admission notes or patient medical histories.” Essentially, it is a natural language processing service that pores through medical text for insights into disease conditions, medications and treatment outcomes from patient notes and other electronic health records.

    The new service is Amazon’s latest foray into the health care sector. In June, the company paid $1 billion to buy online pharmacy PillPack, a Boston-based startup that specializes in packing monthly supplies of medicines to chronically ill patients. In January, Amazon teamed up with Berkshire Hathaway and JPMorgan Chase to form a health care alliance that aims to lower costs and improve the quality of medical care for their employees.

    “Health care, like everything else, is becoming more of an information-based industry, and data is the gold standard — and Amazon knows as well as anyone how to handle and analyze data,” said Robert Field, Wharton lecturer in health care management who is also professor of health management and policy at Drexel University. “It’s a $3.5 trillion industry and 18% of our economy, so who wouldn’t want a piece of that?”

    AI offers “enormous” promise when it comes to bringing in new and improved treatments for patient conditions, such as in the area of radiology, added Hempstead. Machine learning also potentially enables the continual improvement of treatment models, such as identifying people who could participate in clinical trials. Moreover, Amazon’s service could “empower a consumer to be more in charge of their own health and maybe be more active consumer of medical services that might be beneficial to their health,” she said.

    On the flip side, it also could enable insurers to refuse to enroll patients that they might see as too risky, Hempstead said. Insurers are already accessing medical data and using technology in pricing their products for specific markets, and the Amazon service might make it easier for them to have access to such data, she noted.

    #Santé_publique #Données_médicales #Amazon #Intelligence_artificielle

  • #Art, #Information, and #Mapping

    Pratt Manhattan Center, 144 West 14th Street, New York, NY, room 213, adjacent to the gallery

    In conjunction with the exhibition You Are Here NYC: Art, Information, and Mapping which presents data-based maps of NYC, by artists and information designers, that address an increasingly relevant question: in what forms can data visualization become art, and how can artists make data visible? Curated by Katharine Harmon, author of You Are Here–NYC: Mapping the Soul of the City, with Jessie Braden.

    Commentaire :

    ils ont tous des choses à dire sur le carrefour data, carto, design, info, le comment, le pourquoi, la notion d’art dans tout ça. Ca ratisse large et ça peut aller profond. Les deux super bons clients sont le canadien Jer Thorp et la tête sur pattes du MIT, Sarah Williams. Ekene Ijeoma a des interventions dans une veine plus poétique et Doug McCune est intéressant, dans la mesure où il vient d’un monde de data pur et apprend mnt à couler des statues en bronze.


    Doug McCune
    Data Artist

    I’m an Oakland artist who embraces data exploration and map making in an attempt to come to terms with the chaos of urban environments. I experiment heavily with 3D printing and laser cutting to bring digital forms into physical space. I’m a programmer by trade, an amateur cartographer, and a big believer in using data to understand the world.


    Ekene Ijeoma

    is a Nigerian–American artist, designer, fellow at The Kennedy Center and Urban Design Forum and visiting professor at the School of the Art Institute of Chicago.


    Data & Art Miscellanea from Jer Thorp

    When text becomes data it opens up a phenomenal amount of possibility for insight and creative exploration. The problem is that most Natural Language Processing (NPL) tools are hard to use unless you have a good foundation in programming to begin with. We use a lot of NLP in our work at The Office for Creative Research and I’ve often wondered what it would mean to make a language tool designed for open-ended exploration.

    He’s featured in GOOD’s GOOD 100 "tackling pressing global issues,” Adweek’s Creative 100 “visual artist whose imagination and intellect will inspire you,” and GDUSA’s People to Watch “who embody the spirit of the creative community.”


    Sarah Williams

    is currently an Associate Professor of Technology and Urban Planning. She also is Director of the Civic Data Design Lab at MIT’s School of Architecture and Planning. The Civic Data Design Lab works with data, maps, and mobile technologies to develop interactive design and communication strategies that expose urban policy issues to broader audiences.


  • The Biggest Misconceptions about Artificial Intelligence

    Knowledge@Wharton: Interest in artificial intelligence has picked up dramatically in recent times. What is driving this hype? What are some of the biggest prevailing misconceptions about AI and how would you separate the hype from reality?

    Apoorv Saxena: There are multiple factors driving strong interest in AI recently. First is significant gains in dealing with long-standing problems in AI. These are mostly problems of image and speech understanding. For example, now computers are able to transcribe human speech better than humans. Understanding speech has been worked on for almost 20 to 30 years, and only recently have we seen significant gains in that area. The same thing is true of image understanding, and also of specific parts of human language understanding such as translation.

    Such progress has been made possible by applying an old technique called deep learning and running it on highly distributed and scalable computing infrastructure. This combined with availability of large amounts of data to train these algorithms and easy-to-use tools to build AI models, are the major factors driving interest in AI.

    It is natural for people to project the recent successes in specific domains into the future. Some are even projecting the present into domains where deep learning has not been very effective, and that creates a lot of misconception and also hype. AI is still pretty bad in how it learns new concepts and extending that learning to new contexts.

    For example, AI systems still require a tremendous amount of data to train. Humans do not need to look at 40,000 images of cats to identify a cat. A human child can look at two cats and figure out what a cat and a dog is — and to distinguish between them. So today’s AI systems are nowhere close to replicating how the human mind learns. That will be a challenge for the foreseeable future.

    Alors que tout est clean, la dernière phrase est impressionnante : « That will be a challenge for the foreseeable future ». Il ne s’agit pas de renoncer à la compréhension/création de concepts par les ordinateurs, mais de se donner le temps de le faire demain. Dans World without mind , Franklin Foer parle longuement de cette volonté des dirigeants de Google de construire un ordinateur qui serait un cerveau humain amélioré. Mais quid des émotions, des sentiments, de la relation physique au monde ?

    As I mentioned in narrow domains such as speech recognition AI is now more sophisticated than the best humans while in more general domains that require reasoning, context understanding and goal seeking, AI can’t even compete with a five-year old child. I think AI systems have still not figured out to do unsupervised learning well, or learned how to train on a very limited amount of data, or train without a lot of human intervention. That is going to be the main thing that continues to remain difficult . None of the recent research have shown a lot of progress here.

    Knowledge@Wharton: In addition to machine learning, you also referred a couple of times to deep learning. For many of our readers who are not experts in AI, could you explain how deep learning differs from machine learning? What are some of the biggest breakthroughs in deep learning?

    Saxena: Machine learning is much broader than deep learning. Machine learning is essentially a computer learning patterns from data and using the learned patterns to make predictions on new data. Deep learning is a specific machine learning technique.

    Deep learning is modeled on how human brains supposedly learn and use neural networks — a layered network of neurons to learn patterns from data and make predictions. So just as humans use different levels of conceptualization to understand a complex problem, each layer of neurons abstracts out a specific feature or concept in an hierarchical way to understand complex patterns. And the beauty of deep learning is that unlike other machine learning techniques whose prediction performance plateaus when you feed in more training data, deep learning performance continues to improve with more data. Also deep learning has been applied to solve very different sets of problems and shown good performance, which is typically not possible with other techniques. All these makes deep learning special, especially for problems where you could throw in more data and computing power easily.

    Knowledge@Wharton: The other area of AI that gets a lot of attention is natural language processing, often involving intelligent assistants, like Siri from Apple, Alexa from Amazon, or Cortana from Microsoft. How are chatbots evolving, and what is the future of the chatbot?

    Saxena: This is a huge area of investment for all of the big players, as you mentioned. This is generating a lot of interest, for two reasons. It is the most natural way for people to interact with machines, by just talking to them and the machines understanding. This has led to a fundamental shift in how computers and humans interact. Almost everybody believes this will be the next big thing.

    Still, early versions of this technology have been very disappointing. The reason is that natural language understanding or processing is extremely tough. You can’t use just one technique or deep learning model, for example, as you can for image understanding or speech understanding and solve everything. Natural language understanding inherently is different. Understanding natural language or conversation requires huge amounts of human knowledge and background knowledge. Because there’s so much context associated with language, unless you teach your agent all of the human knowledge, it falls short in understanding even basic stuff.

    De la compétition à l’heure du vectorialisme :

    Knowledge@Wharton: That sounds incredible. Now, a number of big companies are active in AI — especially Google, Microsoft, Amazon, Apple in the U.S., or in China you have Baidu, Alibaba and Tencent. What opportunities exist in AI for startups and smaller companies? How can they add value? How do you see them fitting into the broader AI ecosystem?

    Saxena: I see value for both big and small companies. A lot of the investments by the big players in this space are in building platforms where others can build AI applications. Almost every player in the AI space, including Google, has created platforms on which others can build applications. This is similar to what they did for Android or mobile platforms. Once the platform is built, others can build applications. So clearly that is where the focus is. Clearly there is a big opportunity for startups to build applications using some of the open source tools created by these big players.

    The second area where startups will continue to play is with what we call vertical domains. So a big part of the advances in AI will come through a combination of good algorithms with proprietary data. Even though the Googles of the world and other big players have some of the best engineering talent and also the algorithms, they don’t have data. So for example, a company that has proprietary health care data can build a health care AI startup and compete with the big players. The same thing is true of industries such as finance or retail.

    #Intelligence_artificielle #vectorialisme #deep_learning #Google

  • [1710.10777] Understanding Hidden Memories of Recurrent Neural Networks

    Recurrent neural networks (RNNs) have been successfully applied to various natural language processing (NLP) tasks and achieved better results than conventional methods. However, the lack of understanding of the mechanisms behind their effectiveness limits further improvements on their architectures. In this paper, we present a visual analytics method for understanding and comparing RNN models for NLP tasks. We propose a technique to explain the function of individual hidden state units based on their expected response to input texts. We then co-cluster hidden state units and words based on the expected response and visualize co-clustering results as memory chips and word clouds to provide more structured knowledge on RNNs’ hidden states. We also propose a glyph-based sequence visualization based on aggregate information to analyze the behavior of an RNN’s hidden state at the sentence-level. The usability and effectiveness of our method are demonstrated through case studies and reviews from domain experts.

    #langues #langage #mots #terminologie #grammaire

  • Narrative perspective of Neural Networks #8

    Algolit explores neural networks to see how their process can be made legible, visible, understandable. We follow up on the Deep Learning and Natural Language Processing course from Stanford University, by Richard Socher ( Please make sure you watch the first five video’s of the course. Impressions and algoliterary experiments are shared throughout the day. Find notes of previous ones: (...)

    #Algolit / #Workshop, #Netnative_literature, #Algorithm

  • Narrative perspective of Neural Networks #7

    Algolit explores neural networks to see how their process can be made legible, visible, understandable. We follow up on the Deep Learning and Natural Language Processing course from Stanford University, by Richard Socher ( Please make sure you watch the first five video’s of the course. Impressions and algoliterary experiments are shared throughout the day. Find notes of previous ones: (...)

    #Algolit / #Workshop, #Netnative_literature, #Algorithm

  • CASM : Centre for the Analytics of Social media (UK)

    It produces new political, social and policy insight and understanding through social media research which combines new technologies with the social the dark net

    Research areas: data dashboards (big data), software development for early detection of emerging events, e-health, scurity, ethics, privacy, digital democracy, public responses to announcements, speeches and events

    CASM, in collaboration with a wide network of experts and leaders in the field, combines natural language processing, machine learning, statistics, data visuals, grounded theory and ethnography in order to develop large scale social media analysis as a valid instrument of research is ethical, reliable, and useable.

  • Extensible system for analysing and manipulating natural language (#Javascript)

    Rather than being a do-all library for Natural Language Processing (such as NLTK or OpenNLP), retext aims to be useful for more practical use cases (such as censoring profane words or decoding emoticons, but the possibilities are endless) instead of more academic goals (research purposes). retext is inherently modular—it uses plugins (similar to rework for CSS) instead of providing everything out of the box (such as Natural). This makes retext a viable tool for use on the web.

    Avec des plugins pour faire de l’analyse de sentiments, ou du comptage de syllabes (le tout en anglais).

    via @mattdesl (twitter)

    #texte #analyse #langage_naturel

  • Embedly makes your content more engaging and easier to share | Embedly

    Get the world’s most powerful tool for embedding videos, photos, and rich media into websites.
    Natural language processing and text analysis to retrieve elements and text from articles for smarter use in websites and apps

    Use the elements—colors, text, keywords, and entities—that you want from articles. Discard the rest automatically.
    Easy image processing to resize and optimize images for better, faster display on websites and apps

    Make the images you use look great—and display quickly—on any screen, every time.

    Card Generator
    Generate an interactive and responsive Card to post on your site.

    Embed Button
    Make your content easier to share and embed on other sites with the Embed Button.

    The Bookmarklet
    Quickly generate a Card from any web page with the bookmarklet.

    Je me demande comment situer oEmbed pas rapport à un tel service web.

  • Coursera

    Education for Everyone.

    We offer courses from the top universities, for free.
    Learn from world-class professors, watch high quality lectures, achieve mastery via interactive exercises, and collaborate with a global community of students.

    Par exemple :
    Natural Language Processing

    In this class, you will learn fundamental algorithms and mathematical models for processing natural language, and how these can be used to solve practical problems.

    #éducation #programmation