technology:data mining

  • When a small piece of my code predicted the fate of US Presidential Elections.

    I completed my Master’s degree back in 2015 with an emphasis on Opinion Mining (Sentiment Analysis, if you haven’t heard of it, sit tight, it’s going to be interesting). I started working as a web developer and had lost touch in the area of Data Mining (my project thesis). Then one fine Saturday evening, I was relaxing on my couch and was watching over the 2016 US Presidential Election campaign buzz (the final election was yet to be held) going around over the internet. I wasn’t much into world politics by then, but the topic itself intrigued me.I sat for a while and thought, there is so much data over the internet about the campaigns and pre-election stuff that I can make use of, and then dive deeper into the same to figure out what exactly is happening in the larger picture. So my path (...)

    #artificial-intelligence #trump #machine-learning #sentiment-analysis #data-mining

  • 10 Essential Computer Skills for Data Mining

    Data mining is to extract valid information from gigantic data sets and transform the information into potentially useful and ultimately understandable patterns for further use. It not only includes data processing and management but also involves the intelligence methods of machine learning, statistics and database systems, as Wikipedia defines.Data mining is also the important technology in the field of Data science, which has been ranking as the №1 best job in the USA from 2016 to 2018, from Glassdoor’s list of 50 Best Jobs in America. Besides, comparing with 1700 job openings in 2016, the number of listed job openings has increased significantly by 160% in two years. It can be foreseen that the demand for data scientists or the people who have the skills or data analysis will keep (...)

    #data-mining-skills #computer-skills #data-mining #big-data #data-science

  • Facebook and NYU School of Medicine launch research collaboration to improve MRI – Facebook Code

    C’est bô le langage fleuri des experts en public relation...

    Using AI, it may be possible to capture less data and therefore scan faster, while preserving or even enhancing the rich information content of magnetic resonance images. The key is to train artificial neural networks to recognize the underlying structure of the images in order to fill in views omitted from the accelerated scan. This approach is similar to how humans process sensory information. When we experience the world, our brains often receive an incomplete picture — as in the case of obscured or dimly lit objects — that we need to turn into actionable information. Early work performed at NYU School of Medicine shows that artificial neural networks can accomplish a similar task, generating high-quality images from far less data than was previously thought to be necessary.

    In practice, reconstructing images from partial information poses an exceedingly hard problem. Neural networks must be able to effectively bridge the gaps in scanning data without sacrificing accuracy. A few missing or incorrectly modeled pixels could mean the difference between an all-clear scan and one in which radiologists find a torn ligament or a possible tumor. Conversely, capturing previously inaccessible information in an image can quite literally save lives.

    Advancing the AI and medical communities
    Unlike other AI-related projects, which use medical images as a starting point and then attempt to derive anatomical or diagnostic information from them (in emulation of human observers), this collaboration focuses on applying the strengths of machine learning to reconstruct the most high-value images in entirely new ways. With the goal of radically changing the way medical images are acquired in the first place, our aim is not simply enhanced data mining with AI, but rather the generation of fundamentally new capabilities for medical visualization to benefit human health.

    In the interest of advancing the state of the art in medical imaging as quickly as possible, we plan to open-source this work to allow the wider research community to build on our developments. As the project progresses, Facebook will share the AI models, baselines, and evaluation metrics associated with this research, and NYU School of Medicine will open-source the image data set. This will help ensure the work’s reproducibility and accelerate adoption of resulting methods in clinical practice.

    What’s next
    Though this project will initially focus on MRI technology, its long-term impact could extend to many other medical imaging applications. For example, the improvements afforded by AI have the potential to revolutionize CT scans as well. Advanced image reconstruction might enable ultra-low-dose CT scans suitable for vulnerable populations, such as pediatric patients. Such improvements would not only help transform the experience and effectiveness of medical imaging, but they’d also help equalize access to an indispensable element of medical care.

    We believe the fastMRI project will demonstrate how domain-specific experts from different fields and industries can work together to produce the kind of open research that will make a far-reaching and lasting positive impact in the world.

    #Resonance_magnetique #Intelligence_artificielle #Facebook #Neuromarketing

  • For Sale: Survey #Data on Millions of High School Students - The New York Times

    Consumers’ personal details are collected in countless ways these days, from Instagram clicks, dating profiles and fitness apps. While many of those efforts are aimed at adults, the recruiting methods for some student recognition programs give a peek into the widespread and opaque world of data mining for millions of minors — and how students’ profiles may be used to target them for educational and noneducational offers. MyCollegeOptions, for instance, says it may give student loan services, test prep and other companies access to student data.

    #données_personnelles #à_vendre

  • Cambridge Analytica demonstrates that Facebook needs to give researchers more access.

    In a 2013 paper, psychologist Michal Kosinski and collaborators from University of Cambridge in the United Kingdom warned that “the predictability of individual attributes from digital records of behavior may have considerable negative implications,” posing a threat to “well-being, freedom, or even life.” This warning followed their striking findings about how accurately the personal attributes of a person (from political leanings to intelligence to sexual orientation) could be inferred from nothing but their Facebook likes. Kosinski and his colleagues had access to this information through the voluntary participation of the Facebook users by offering them the results of a personality quiz, a method that can drive viral engagement. Of course, one person’s warning may be another’s inspiration.

    Kosinski’s original research really was an important scientific finding. The paper has been cited more than 1,000 times and the dataset has spawned many other studies. But the potential uses for it go far beyond academic research. In the past few days, the Guardian and the New York Times have published a number of new stories about Cambridge Analytica, the data mining and analytics firm best known for aiding President Trump’s campaign and the pro-Brexit campaign. This trove of reporting shows how Cambridge Analytica allegedly relied on the psychologist Aleksandr Kogan (who also goes by Aleksandr Spectre), a colleague of the original researchers at Cambridge, to gain access to profiles of around 50 million Facebook users.

    According to the Guardian’s and New York Times’ reporting, the data that was used to build these models came from a rough duplicate of that personality quiz method used legitimately for scientific research. Kogan, a lecturer in another department, reportedly approached Kosinski and their Cambridge colleagues in the Psychometric Centre to discuss commercializing the research. To his credit, Kosinski declined. However, Kogan built an app named thisismydigitallife for his own startup, Global Science Research, which collected the same sorts of data. GSR paid Mechanical Turk workers (contrary to the terms of Mechanical Turk) to take a psychological quiz and provide access to their Facebook profiles. In 2014, under the contract with the parent company of Cambridge Analytica, SCL, that data was harvested and used to build a model of 50 million U.S. Facebook users that included allegedly 5,000 data points on each user.

    So if the Facebook API allowed Kogan access to this data, what did he do wrong? This is where things get murky, but bear with us. It appears that Kogan deceitfully used his dual roles as a researcher and an entrepreneur to move data between an academic context and a commercial context, although the exact method of it is unclear. The Guardian claims that Kogan “had a licence from Facebook to collect profile data, but it was for research purposes only” and “[Kogan’s] permission from Facebook to harvest profiles in large quantities was specifically restricted to academic use.” Transferring the data this way would already be a violation of the terms of Facebook’s API policies that barred use of the data outside of Facebook for commercial uses, but we are unfamiliar with Facebook offering a “license” or special “permission” for researchers to collect greater amounts of data via the API.

    Regardless, it does appear that the amount of data thisismydigitallife was vacuuming up triggered a security review at Facebook and an automatic shutdown of its API access. Relying on Wylie’s narrative, the Guardian claims that Kogan “spoke to an engineer” and resumed access:

    “Facebook could see it was happening,” says Wylie. “Their security protocols were triggered because Kogan’s apps were pulling this enormous amount of data, but apparently Kogan told them it was for academic use. So they were like, ‘Fine’.”

    Kogan claims that he had a close working relationship with Facebook and that it was familiar with his research agendas and tools.

    A great deal of research confirms that most people don’t pay attention to permissions and privacy policies for the apps they download and the services they use—and the notices are often too vague or convoluted to clearly understand anyway. How many Facebook users give third parties access to their profile so that they can get a visualization of the words they use most, or to find out which Star Wars character they are? It isn’t surprising that Kosinski’s original recruitment method—a personality quiz that provided you with a psychological profile of yourself based on a common five-factor model—resulted in more than 50,000 volunteers providing access to their Facebook data. Indeed, Kosinski later co-authored a paper detailing how to use viral marketing techniques to recruit study participants, and he has written about the ethical dynamics of utilizing friend data.

    #Facebook #Cambridge_analytica #Recherche

  • A Statistical Guide for the Ethically Perplexed - CRC Press Book

    A explorer...


    Includes extensive discussions of U.S. federal court decisions where probabilistic and statistical reasoning was paramount
    Clarifies a variety of statistical and probabilistic paradoxes and fallacies
    Provides probabilistic tools to help readers understand the context that informs decision making in medical situations such as screening
    Distinguishes between the notions of specific and general causation and explains how specific causation can be legally argued
    Discusses the importance of cross-validation and the problem of making legitimate inferences based on culling and data mining
    Explores the darker side of psychometrics, including forced sterilization, immigration restriction, and racial purity laws
    Offers further reading and other supplements online


    Lauded for their contributions to statistics, psychology, and psychometrics, the authors make statistical methods relevant to readers’ day-to-day lives by including real historical situations that demonstrate the role of statistics in reasoning and decision making. The historical vignettes encompass the English case of Sally Clark, breast cancer screening, risk and gambling, the Federal Rules of Evidence, “high-stakes” testing, regulatory issues in medicine, difficulties with observational studies, ethics in human experiments, health statistics, and much more. In addition to these topics, seven U.S. Supreme Court decisions reflect the influence of statistical and psychometric reasoning and interpretation/misinterpretation.

    Exploring the intersection of ethics and statistics, this comprehensive guide assists readers in becoming critical and ethical consumers and producers of statistical reasoning and analyses. It will help them reason correctly and use statistics in an ethical manner.

    #data #statistiques #visualisation #ethique

  • Data Visualization

    À propos de ce #cours : Learn the general concepts of #data_mining along with basic methodologies and applications. Then dive into one subfield in data mining: pattern discovery. Learn in-depth concepts, methods, and applications of pattern discovery in data mining. We will also introduce methods for pattern-based classification and some interesting applications of pattern discovery. This course provides you the opportunity to learn skills and content to practice and engage in scalable pattern discovery methods on massive transactional data, discuss pattern evaluation measures, and study methods for mining diverse kinds of patterns, sequential patterns, and sub-graph patterns.

    #MOOC #data_visualization #visualisation #cours_online #cartographie #cursera

  • How Big data mines personal info to manipulate voters and craft fake news
    (June 2017, Nina Burleigh)

    #Facebook, #Cambridge_Analytica, #artificial_intelligence #big_data #psychographics #OCEAN #surveillance

    “It’s my ([Alexander Nix]) privilege to speak to you today about the power of Big Data and psychographics in the electoral process,” he began. As he clicked through slides, he explained how Cambridge Analytica can appeal directly to people’s emotions, bypassing cognitive roadblocks, thanks to the oceans of data it can access on every man and woman in the country.

    After describing Big Data, Nix talked about how Cambridge was mining it for political purposes, to identify “mean personality” and then segment personality types into yet more specific subgroups, using other variables, to create ever smaller groups susceptible to precisely targeted messages.


    Big Data, artificial intelligence and algorithms designed and manipulated by strategists like the folks at Cambridge have turned our world into a Panopticon


    it made tens of millions of “friends” by first employing low-wage tech-workers to hand over their Facebook profiles: It spiders through Facebook posts, friends and likes, and, within a matter of seconds, spits out a personality profile, including the so-called OCEAN psychological tendencies test score (openness, conscientiousness, extraversion, agreeableness and neuroticism)


    Facebook was even more useful for Trump, with its online behavioral data on nearly 2 billion people around the world, each of whom is precisely accessible to strategists and marketers who can afford to pay for the peek. Team Trump created a 220 million–person database, nicknamed Project Alamo, using voter registration records, gun ownership records, credit card purchase histories and the monolithic data vaults Experian PLC, Datalogix, Epsilon and Axiom Corporation.


    Facebook offers advertisers is its Lookalike Audiences program. An advertiser (or a political campaign manager) can come to Facebook with a small group of known customers or supporters, and ask Facebook to expand it. Using its access to billions of posts and pictures, likes and contacts, Facebook can create groups of people who are “like” that initial group, and then target advertising made specifically to influence it.


    By 2012, there had been huge advances in what Big Data, social media and AI could do together. That year, Facebook conducted a happy-sad emotional manipulation experiment, splitting a million people into two groups and manipulating the posts so that one group received happy updates from friends and another received sad ones. They then ran the effects through algorithms and proved—surprise—that they were able to affect people’s moods. (Facebook, which has the greatest storehouse of personal behavior data ever amassed, is still conducting behavioral research, mostly, again, in the service of advertising and making money.


    Psychographic algorithms allow strategists to target not just angry racists but also the most intellectually gullible individuals, people who make decisions emotionally rather than cognitively. For Trump, such voters were the equivalent of diamonds in a dark mine. Cambridge apparently helped with that too. A few weeks before the election, in a Sky News report on the company, an employee was actually shown on camera poring over a paper on “ The Need for Cognition Scale,” which, like the OCEAN test, can be applied to personal data, and which measures the relative importance of thinking versus feeling in an individual’s decision-making.


    Big Data technology has so far outpaced legal and regulatory frameworks that discussions about the ethics of its use for political purposes are still rare. No senior member of Congress or administration official in Washington has placed a very high priority on asking what psychographic data mining means for privacy, nor about the ethics of political messaging based on evading cognition or rational thinking, nor about the AI role in mainstreaming racist and other previously verboten speech.


    After months of investigations and increasingly critical articles in the British press (especially by The Guardian ’s Carole Cadwalladr, who has called Cambridge Analytica’s work the framework for an authoritarian surveillance state, and whose reporting Cambridge has since legally challenged), the British Information Commissioner’s Office (ICO), an independent agency that monitors privacy rights and adherence to the U.K.’s strict laws, announced May 17 that it is looking into Cambridge and SCL for their work in the Brexit vote and other elections.


    Now in the White House, Kushner heads the administration’s Office of Technology and Innovation. It will focus on “technology and data,” the administration stated. Kushner said he plans to use it to help run government like a business, and to treat American citizens “like customers.”

  • EU copyright reform is coming. Is your startup ready?

    Last Friday, members of Berlin’s startup community gathered at Silicon Allee for a copyright policy roundtable discussion hosted by Allied for Startups. The event sparked debate and elicited feedback surrounding the European Commission’s complex drafted legislation that would have significant impact on startups in the EU. Our Editor-in-Chief, Julia Neuman, gives you the rundown here — along with all the details you should know about the proposed reform.

    ‘Disruption’ in the startup world isn’t always a good thing — especially when it involves challenging legislation. Over the past five years, as big data and user-generated content began to play an increasing role in our society, startups have worked tirelessly to navigate laws regarding privacy and security in order to go about business as usual. Now, they may soon be adding copyright concerns to their list of potential roadblocks.

    The forthcoming copyright reform proposed by the European Commission severely threatens the success and momentum that startups have gained in the EU, and it’s being introduced under the guise of “a more modern, more European copyright framework.”

    On September 14, 2016, the European Commission tabled its Proposal for a Directive on Copyright in the Digital Single Market (commonly referred to as the “Copyright Directive”) — a piece of draft legislation that would have significant impact on a wide variety of modern copyrighted content. Consequently, it poses a direct threat to startups.

    Members of the startup community are now coming together, unwilling to accept these measures without a fight. On Friday, members of Allied for Startups and Silicon Allee — alongside copyright experts and Berlin-based entrepreneurs and investors — met at Silicon Allee’s new campus in Mitte for a policy roundtable discussion. Additional workshop discussions are taking place this week in Warsaw, Madrid and Paris. The ultimate goal? To get startups’ voices heard in front of policymakers and counter this legislation.
    Sparking conversation at Silicon Allee

    Bird & Bird Copyright Lawyer and IP Professor Martin Senftleben led the roundtable discussions in Berlin, outlining key clauses and offering clarifying commentary. He then invited conversation from guests — which included representatives from content-rich startups such as Fanmiles, Videopath, and Ubermetrics. The result was a well-balanced input of perspectives and testimonials that sparked an increased desire to fight back. The roundtable covered the three main areas affected by the proposed reforms: user-generated content, text and data mining, and the neighboring right for press publishers.
    User-generated content

    The internet has allowed us all to become content creators with an equal opportunity to make our voices heard around the world. With this transition comes evolving personal responsibilities. Whereas in the past, copyright law only concerned a small percentage of society — today it concerns anyone posting to social media, uploading unique content, or founding a company that relies on user-generated content as part of its business model.

    The proposed EU copyright reform shifts copyright burden to content providers, making them liable for user content and forcing them to apply content filtering technology to their platforms. As it stands now, management of copyright infringement is a passive process. Companies are not required to monitor or police user-generated content, instead waiting for infringement notices to initiate relevant takedowns.

    New laws imply that companies would have to constantly police their platforms. As you can imagine, this would quickly rack up operating costs — not to mention deter investors from committing if there’s such a inherently persistent and high legal risk for copyright infringement. Furthermore, filtering technology would not exactly promote public interest or media plurality, as an efficiency-based filtering system would be more likely to result in overblocking and censoring (even if unintentional). This result is counter to the expressed aims of the reform.

    “Having this necessity to add filtering technology from the start would kill any innovation for new startups, which is the reason why we’re all here and this economy is booming and creating jobs,” said Fabian Schmidt, Founder of Fanmiles. “The small companies suddenly cannot innovate and compete anymore.”

    Text and data mining

    The proposed reform also blocks startups from using text and data mining technology, consequently preventing the rich kind of data analysis that has added value and yielded deeper insights for growing startups. Copyright law today accounts for lawful access and consultation, however not for the automated process of reading and drawing conclusions. The scraping and mining of freely available texts could give rise to complex, costly legal problems from the get-go — problems that not even the most prudent founder teams could navigate (unless they work to the benefit of research institutions, which are exempt from the measure).

    What kind of message does this send out to new startups? As with laws dealing with user-generated content, these measures don’t entice entrepreneurs to turn their seeds of ideas into profitable companies. Nor do they get VCs jumping to invest. Data input from mining and scraping suddenly gives rise to a huge legal issue that certainly does not benefit the public interest.

    Senftleben reminded the group in Berlin that these types of legislation normally take several years to implement, and that the proposed policy could have amplified effects down the road as the role of data mining increases. “If this legislation is already limiting now, who knows what kind of text and data mining will be used in ten years and how it will play in,” he said.
    Neighboring right for press publishers

    The third and final point discussed at the roundtable has gathered the most media attention thus far. It’s the “elephant in the room,” unjustly pitting established publishers against startups. Proposed legislation creates an exclusive right for publishers that protects their content for digital use in order to “to ensure quality journalism and citizens’ access to information.”

    Sure, this reasoning sounds like a positive contribution to a free and democratic society. But closer examination reveals that these publishers’ outdated and financially unviable business models are being grandfathered in for protection at the expense of more innovative content models.

    It’s not hard to see why this is happening. Publishers have lobbying power, and they are bleeding money in today’s digital climate. “I work a lot with publishers. Their position here in Europe is a little more old school,” said one of the founders present at the discussion. “Their business model and revenues are going down, so they’re going to fight hard.”

    Axel Springer, for example, is lobbying for greater protection; they want a piece of Google’s success. But the most interesting aspect of this measure is that it’s unclear how much value it would add for publishers, who already have rights to digital reproduction from the individual content creators employed under contract with their firms. A freelance journalist contributing to Die Zeit, for example, is already transferring digital reproduction rights to the newspaper just by agreeing to publish.

    The drafted legislation makes it pretty clear that content aggregating search engines would take a big hit when they would inevitably have to pay content reproduction fees to publishers. But the interdependent relationship between publishers and online search aggregation services makes this legislation unlikely to generate a meaningful revenue stream for publishers anyway: Publishers want compensation for snippets of articles that show up on search engines, and search engines want compensation for bringing attention to them in the first place. In the end, content aggregators would likely just stop their use of content fragments instead of resorting to pay license fees to publishers.

    It’s unclear how the proposed legislation could promote media plurality and freedom; instead, it seems to promote market concentration and monopolization of content publishing, potentially stifling free and open access to information.

    “I know two small aggregators here in Germany that have given up because of this,” said Tobias Schwarz, Coworking Manager at Sankt Oberholz in Berlin.

    What comes next? Turning discussion into action

    What is clear now is that copyright law has potential to affect anyone. Startups in Europe, especially, are at risk with these new reforms. As players in the European economy, they have not been present in the policy debate so far. Allied for Startups and Silicon Allee are inviting founders, entrepreneurs, and interested members in the tech community to come forward and make their voices heard. They invite contributions to an open letter to the European Parliament which dives into this topic in more detail, explaining how toxic the Copyright Directive is for companies who are trying to stay alive without incurring €60 million in development costs.

    “A lot of startup leaders have their heads down working on their next feature, without realizing policymakers are also creating something that can instantly kill it,” said Silicon Allee co-founder Travis Todd. “But if more startups come to the table and tell others what they learned, they will become more aware of these potential roadblocks and ultimately help change them.”

    To find out more information, participate at the next discussion, or share your ideas and testimonials on this policy discussion, please get in touch! Drop a line to, tweet to @allied4startups, or join the online conversation using #copyright4startups.

  • Be Careful Celebrating Google’s New Ad Blocker. Here’s What’s Really Going On.

    Google, a data mining and extraction company that sells personal information to advertisers, has hit upon a neat idea to consolidate its already-dominant business : block competitors from appearing on its platforms. The company announced that it would establish an ad blocker for the Chrome web browser, which has become the most popular in America, employed by nearly half of the nation’s web users. The ad blocker — which Google is calling a “filter” — would roll out next year, and would be the (...)

    #Google #AdBlock #données #data-mining

    • The Chrome ad blocker would stop ads that provide a “frustrating experience,” according to Google’s blog post announcing the change. The ads blocked would match the standards produced by the Coalition for Better Ads, an ostensibly third-party group. For sure, the ads that would get blocked are intrusive: auto-players with sound, countdown ads that make you wait 10 seconds to get to the site, large “sticky” ads that remain constant even when you scroll down the page.

      But who’s part of the Coalition for Better Ads? Google, for one, as well as Facebook. Those two companies accounted for 99 percent of all digital ad revenue growth in the United States last year, and 77 percent of gross ad spending. As Mark Patterson of Fordham University explained, the Coalition for Better Ads is “a cartel orchestrated by Google.

      So this is a way for Google to crush its few remaining competitors by pre-installing an ad zapper that it controls to the most common web browser. That’s a great way for a monopoly to remain a monopoly.

  • Secret Back Door in Some U.S. Phones Sent Data to China, Analysts Say

    For about $50, you can get a smartphone with a high-definition display, fast data service and, according to security contractors, a secret feature : a backdoor that sends all your text messages to China every 72 hours. Security contractors recently discovered preinstalled software in some Android phones that monitors where users go, whom they talk to and what they write in text messages. The American authorities say it is not clear whether this represents secretive data mining for (...)

    #Google #smartphone #Android #profiling #Shanghai_Adups_Technology_Company #backdoor

  • EU copyright proposal reinforces DRM

    On 14 September the European Commission (EC) published its long-awaited proposal for a Directive on copyright in the Digital Single Market. While we welcome the proposal to introduce a mandatory exception for ’text and data mining’ (TDM) in the field of scientific research, we are concerned about the inclusion of a far-reaching “technical safeguards” clause granted to rightholders in order to limit the newly established exception.

    The proposal grants a mandatory exception to research organisations to carry out TDM of copyrighted works to which they have lawful access. The exception is only applicable to research organisations, thus narrowing its scope and excluding everyone else with the lawful access to the copyrighted works.

    According to the accompanying Impact Assessment, the TDM exception has the potential of inflicting a high number of downloads of the works, and that is why the rightholders are allowed to apply “necessary” technical measures in the name of “security and integrity” of their networks and databases.

    Such a requirement, as it is proposed by the EC in the current text, gives rightholders a wide-reaching right to restrict the effective implementation of the new exception. Rightholders are free to apply whichever measure they deem “necessary” to protect their rights in the TDM exception, and to choose the format and modalities of such technical measures.

    This provision will lead to a wider implementation of “digital restrictions management” (DRM) technologies. These technologies are already used extensively to arbitrarily restrict the lawful use of accessible works under the new TDM exception. This reference to “necessary technical safeguards” is excessive and can make the mandatory TDM exception useless. It is worth repeating that the exception is already heavily limited to cover only r esearch organisations with public interest.

    Further reasons to forbid the use of DRM technologies in the exception are:

    DRM leads to vendor lock-in. As researchers will need a specific compatible software in order to be able to access the work, they will be locked to a particular vendor or provider for arbitrary reasons. These technical safeguards will most likely stop researchers from exercising their right under the exception of using their own tools to extract data, and can lead to the factual monopoly of a handful of companies providing these technologies.
    DRM excludes free software users. DRM always relies on proprietary components to work. These components, by definition, are impossible to implement in Free Software. The right of Free Software users to access resources under the exception will be violated.
    DRM technologies increase the cost of research and education. Accessing DRM-protected resources typically requires purchasing specific proprietary software. Such technology is expensive and it is important to ask how much the implementation of these technologies would cost for research and educational institutions throughout Europe. Furthermore, very often this software cannot be shared, so every research workstation would need to purchase a separate copy or license for the software.
    DRM artificially limits sharing between peers. A typical functionality DRM provides is to cap the number of copies you can make of documents and data. This will force different researchers to access and download data and documents several times even if they are working on the same team. This is a waste of time and resources. As DRM also typically limits the number of downloads, teams could find themselves cut of from resources they legitimately have a right to access under the exception.

    We ask the European Parliament and the EU member states to explicitly forbid the use of harmful DRM practices in the EU copyright reform, especially with regard to already heavily limited exceptions.

  • Geographical Analysis, Urban Modeling, Spatial Statistics
    Eleventh International Conference - GEOG-AND-MOD 16

    During the past decades the main problem in geographical analysis was the lack of spatial data availability. Nowadays the wide diffusion of electronic devices containing geo-referenced information generates a great production of spatial data. Volunteered geographic information activities (e.g. OpenStreetMap, Wikimapia), public initiatives (e.g. Open data, Spatial Data Infrastructures, Geo-portals) and private projects (e.g. Google Earth, Bing Maps, etc.) produced an overabundance of spatial data, which, in many cases, does not help the efficiency of decision processes. The increase of geographical data availability has not been fully coupled by an increase of knowledge to support spatial decisions. The inclusion of spatial simulation techniques in recent GIS software favoured the diffusion of these methods, but in several cases led to the mechanism based on which buttons have to pressed without having geography or processes in mind. Spatial modelling, analytical techniques and geographical analyses are therefore required in order to analyse data and to facilitate the decision process at all levels, with a clear identification of the geographical information needed and reference scale to adopt. Old geographical issues can find an answer thanks to new methods and instruments, while new issues are developing, challenging the researchers for new solutions. This Conference aims at contributing to the development of new techniques and methods to improve the process of knowledge acquisition.

    The programme committee especially requests high quality submissions on the following Conference Themes :

    Geostatistics and spatial simulation;
    Agent-based spatial modelling;
    Cellular automata spatial modelling;
    Spatial statistical models;
    Space-temporal modelling;
    Environmental Modelling;
    Geovisual analytics, geovisualisation, visual exploratory data analysis;
    Visualisation and modelling of track data;
    Spatial Optimization;
    Interaction Simulation Models;
    Data mining, spatial data mining;
    Spatial Data Warehouse and Spatial OLAP;
    Integration of Spatial OLAP and Spatial data mining;
    Spatial Decision Support Systems;
    Spatial Multicriteria Decision Analysis;
    Spatial Rough Set;
    Spatial extension of Fuzzy Set theory;
    Ontologies for Spatial Analysis;
    Urban modeling;
    Applied geography;
    Spatial data analysis;
    Dynamic modelling;
    Simulation, space-time dynamics, visualization and virtual reality.

    #géographie #modélisation #statistiques

  • A Plethora of Open Data Repositories (i.e., thousands !) - Data Science Central

    A Plethora of Open Data Repositories (i.e., thousands!)

    Posted by Kirk Borne on August 30, 2015 at 2:09pm
    View Blog

    Open data repositories are valuable for many reasons, including:

    (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets;

    (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses;

    (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; and

    (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data.

    Here are some sources and meta-sources of open data:

    #data #statistiques #open_data

  • Bosses Harness Big Data to Predict Which Workers Might Get Sick - WSJ

    Some firms, such as Welltok and GNS Healthcare Inc., also buy information from data brokers that lets them draw connections between consumer behavior and health needs.

    Employers generally aren’t allowed to know which individuals are flagged by data mining, but the wellness firms—usually paid several dollars a month per employee—provide aggregated data on the number of employees found to be at risk for a given condition.

    To determine which employees might soon get pregnant, Castlight recently launched a new product that scans insurance claims to find women who have stopped filling birth-control prescriptions, as well as women who have made fertility-related searches on Castlight’s health app.

    That data is matched with the woman’s age, and if applicable, the ages of her children to compute the likelihood of an impending pregnancy, says Jonathan Rende, Castlight’s chief research and development officer. She would then start receiving emails or in-app messages with tips for choosing an obstetrician or other prenatal care. If the algorithm guessed wrong, she could opt out of receiving similar messages.

  • Data Mining Reveals the Extent of China’s Ghost Cities

    In recent years, China has undergone a period of urban growth that is unprecedented in human history. The number of square kilometers devoted to urban living grew from 8,800 in 1984 to 41,000 in 2010. And that was just the start. China used more concrete between 2011 and 2013 than the U.S. used in the entire 20th century.

    Some of this building has been misplaced. In various parts of China, developers have built so much housing so quickly that it has outstripped demand, even in the world’s most populous country. The result is the well-publicized phenomenon of ghost cities—entire urban areas that are more or less deserted.

    But much of the reporting on ghost cities is anecdotal or based on unreliable measurements such as a simple count of the number of lights on at night in residential buildings. That’s a particularly inaccurate method, not least because it ignores seasonal variations caused by tourism. Many places are busy during the tourist season but empty during the off-season, and not just in China. So being unable to distinguish these from ghost cities is something of a problem.

    And that raises an interesting question: how bad, really, is the problem of ghost cities in China?


  • Top 10 data mining algorithms in plain English |

    Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

    Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

    What are we waiting for? Let’s get started!


    1. C4.5
    2. k-means
    3. Support vector machines
    4. Apriori
    5. EM
    6. PageRank
    7. AdaBoost
    8. kNN
    9. Naive Bayes
    10. CART

  • Why Big Data Missed the Early Warning Signs of Ebola

    Merci à @freakonometrics d’avoir signalé cet article sur Twitter

    ith the Centers for Disease Control now forecasting up to 1.4 million new infections from the current Ebola outbreak, what could “big data” do to help us identify the earliest warnings of future outbreaks and track the movements of the current outbreak in realtime? It turns out that monitoring the spread of Ebola can teach us a lot about what we missed — and how data mining, translation, and the non-Western world can help to provide better early warning tools.

    Earlier this month, Harvard’s HealthMap service made world headlines for monitoring early mentions of the current Ebola outbreak on March 14, 2014, “nine days before the World Health Organization formally announced the epidemic,” and issuing its first alert on March 19. Much of the coverage of HealthMap’s success has emphasized that its early warning came from using massive computing power to sift out early indicators from millions of social media posts and other informal media.

    #ebola #statistics #big_data

    • By the time HealthMap monitored its very first report, the Guinean government had actually already announced the outbreak and notified the WHO.

      cf et

      et sur l’impasse de #GDELT (l’auteur de l’article, Kalev H. Leetaru, étant le créateur de cette base de données) :

      Part of the problem is that the majority of media in Guinea is not published in English, while most monitoring systems today emphasize English-language material. The GDELT Project attempts to monitor and translate a cross-section of the world’s news media each day, yet it is not capable of translating 100 percent of global news coverage. It turns out that GDELT actually monitored the initial discussion of Dr. Keita’s press conference on March 13 and detected a surge in domestic coverage beginning on March 14, the day HealthMap flagged the first media mention. The problem is that all of this media coverage was in French — and was not among the French material that GDELT was able to translate those days.

  • Data Mining Reveals How Conspiracy Theories Emerge on Facebook | MIT Technology Review


    Data Mining Reveals How Conspiracy Theories Emerge on Facebook | MIT Technology Review -

    5 minutes ago

    from Bookmarklet



    “Some people are more susceptible to conspiracy theories than others, say computational social scientists who have studied how false ideas jump the “credulity barrier” on Facebook.” - Raffa