industryterm:data mining

GamesIndustry.biz @gamesindustry CC BY 15/03/2019

Epic responds to accusations of Steam data mining
▻https://www.gamesindustry.biz/articles/2019-03-15-epic-responds-to-accusations-of-steam-data-mining
Sweeney says issue steams from “our rush to implement social features in the early days of Fortnite”
#Publishing

GamesIndustry.biz @gamesindustry CC BY

Écrire un commentaire
Articles repérés par Hervé Le Crosnier @hlc CC BY 21/08/2018

Facebook and NYU School of Medicine launch research collaboration to improve MRI – Facebook Code
▻https://code.fb.com/ai-research/facebook-and-nyu-school-of-medicine-launch-research-collaboration-to-improv
https://code.fb.com/wp-content/uploads/2018/08/mri2.jpg
C’est bô le langage fleuri des experts en public relation...
Using AI, it may be possible to capture less data and therefore scan faster, while preserving or even enhancing the rich information content of magnetic resonance images. The key is to train artificial neural networks to recognize the underlying structure of the images in order to fill in views omitted from the accelerated scan. This approach is similar to how humans process sensory information. When we experience the world, our brains often receive an incomplete picture — as in the case of obscured or dimly lit objects — that we need to turn into actionable information. Early work performed at NYU School of Medicine shows that artificial neural networks can accomplish a similar task, generating high-quality images from far less data than was previously thought to be necessary.
In practice, reconstructing images from partial information poses an exceedingly hard problem. Neural networks must be able to effectively bridge the gaps in scanning data without sacrificing accuracy. A few missing or incorrectly modeled pixels could mean the difference between an all-clear scan and one in which radiologists find a torn ligament or a possible tumor. Conversely, capturing previously inaccessible information in an image can quite literally save lives.
Advancing the AI and medical communities
Unlike other AI-related projects, which use medical images as a starting point and then attempt to derive anatomical or diagnostic information from them (in emulation of human observers), this collaboration focuses on applying the strengths of machine learning to reconstruct the most high-value images in entirely new ways. With the goal of radically changing the way medical images are acquired in the first place, our aim is not simply enhanced data mining with AI, but rather the generation of fundamentally new capabilities for medical visualization to benefit human health.
In the interest of advancing the state of the art in medical imaging as quickly as possible, we plan to open-source this work to allow the wider research community to build on our developments. As the project progresses, Facebook will share the AI models, baselines, and evaluation metrics associated with this research, and NYU School of Medicine will open-source the image data set. This will help ensure the work’s reproducibility and accelerate adoption of resulting methods in clinical practice.
What’s next
Though this project will initially focus on MRI technology, its long-term impact could extend to many other medical imaging applications. For example, the improvements afforded by AI have the potential to revolutionize CT scans as well. Advanced image reconstruction might enable ultra-low-dose CT scans suitable for vulnerable populations, such as pediatric patients. Such improvements would not only help transform the experience and effectiveness of medical imaging, but they’d also help equalize access to an indispensable element of medical care.
We believe the fastMRI project will demonstrate how domain-specific experts from different fields and industries can work together to produce the kind of open research that will make a far-reaching and lasting positive impact in the world.
#Resonance_magnetique #Intelligence_artificielle #Facebook #Neuromarketing
- #Facebook
- #NYU School of Medicine
Articles repérés par Hervé Le Crosnier @hlc CC BY
Écrire un commentaire
Kassem @kassem CC BY-NC-SA 30/07/2018

For Sale: Survey #Data on Millions of High School Students - The New York Times
▻https://www.nytimes.com/2018/07/29/business/for-sale-survey-data-on-millions-of-high-school-students.html
https://static01.nyt.com/images/2018/07/30/business/30STUDENTDATA-1/30STUDENTDATA-1-facebookJumbo.jpg
Consumers’ personal details are collected in countless ways these days, from Instagram clicks, dating profiles and fitness apps. While many of those efforts are aimed at adults, the recruiting methods for some student recognition programs give a peek into the widespread and opaque world of data mining for millions of minors — and how students’ profiles may be used to target them for educational and noneducational offers. MyCollegeOptions, for instance, says it may give student loan services, test prep and other companies access to student data.
#données_personnelles #à_vendre

Kassem @kassem CC BY-NC-SA

Écrire un commentaire
Articles repérés par Hervé Le Crosnier @hlc CC BY 24/03/2018

Cambridge Analytica demonstrates that Facebook needs to give researchers more access.
▻https://slate.com/technology/2018/03/cambridge-analytica-demonstrates-that-facebook-needs-to-give-researchers-more
https://compote.slate.com/images/d9d79ae8-e55d-4c61-94c5-033912ab43cb.jpeg?width=780&height=520&rect=5000x3333&offset=1x0
In a 2013 paper, psychologist Michal Kosinski and collaborators from University of Cambridge in the United Kingdom warned that “the predictability of individual attributes from digital records of behavior may have considerable negative implications,” posing a threat to “well-being, freedom, or even life.” This warning followed their striking findings about how accurately the personal attributes of a person (from political leanings to intelligence to sexual orientation) could be inferred from nothing but their Facebook likes. Kosinski and his colleagues had access to this information through the voluntary participation of the Facebook users by offering them the results of a personality quiz, a method that can drive viral engagement. Of course, one person’s warning may be another’s inspiration.
Kosinski’s original research really was an important scientific finding. The paper has been cited more than 1,000 times and the dataset has spawned many other studies. But the potential uses for it go far beyond academic research. In the past few days, the Guardian and the New York Times have published a number of new stories about Cambridge Analytica, the data mining and analytics firm best known for aiding President Trump’s campaign and the pro-Brexit campaign. This trove of reporting shows how Cambridge Analytica allegedly relied on the psychologist Aleksandr Kogan (who also goes by Aleksandr Spectre), a colleague of the original researchers at Cambridge, to gain access to profiles of around 50 million Facebook users.
According to the Guardian’s and New York Times’ reporting, the data that was used to build these models came from a rough duplicate of that personality quiz method used legitimately for scientific research. Kogan, a lecturer in another department, reportedly approached Kosinski and their Cambridge colleagues in the Psychometric Centre to discuss commercializing the research. To his credit, Kosinski declined. However, Kogan built an app named thisismydigitallife for his own startup, Global Science Research, which collected the same sorts of data. GSR paid Mechanical Turk workers (contrary to the terms of Mechanical Turk) to take a psychological quiz and provide access to their Facebook profiles. In 2014, under the contract with the parent company of Cambridge Analytica, SCL, that data was harvested and used to build a model of 50 million U.S. Facebook users that included allegedly 5,000 data points on each user.
So if the Facebook API allowed Kogan access to this data, what did he do wrong? This is where things get murky, but bear with us. It appears that Kogan deceitfully used his dual roles as a researcher and an entrepreneur to move data between an academic context and a commercial context, although the exact method of it is unclear. The Guardian claims that Kogan “had a licence from Facebook to collect profile data, but it was for research purposes only” and “[Kogan’s] permission from Facebook to harvest profiles in large quantities was specifically restricted to academic use.” Transferring the data this way would already be a violation of the terms of Facebook’s API policies that barred use of the data outside of Facebook for commercial uses, but we are unfamiliar with Facebook offering a “license” or special “permission” for researchers to collect greater amounts of data via the API.
Regardless, it does appear that the amount of data thisismydigitallife was vacuuming up triggered a security review at Facebook and an automatic shutdown of its API access. Relying on Wylie’s narrative, the Guardian claims that Kogan “spoke to an engineer” and resumed access:
“Facebook could see it was happening,” says Wylie. “Their security protocols were triggered because Kogan’s apps were pulling this enormous amount of data, but apparently Kogan told them it was for academic use. So they were like, ‘Fine’.”
Kogan claims that he had a close working relationship with Facebook and that it was familiar with his research agendas and tools.
A great deal of research confirms that most people don’t pay attention to permissions and privacy policies for the apps they download and the services they use—and the notices are often too vague or convoluted to clearly understand anyway. How many Facebook users give third parties access to their profile so that they can get a visualization of the words they use most, or to find out which Star Wars character they are? It isn’t surprising that Kosinski’s original recruitment method—a personality quiz that provided you with a psychological profile of yourself based on a common five-factor model—resulted in more than 50,000 volunteers providing access to their Facebook data. Indeed, Kosinski later co-authored a paper detailing how to use viral marketing techniques to recruit study participants, and he has written about the ethical dynamics of utilizing friend data.
#Facebook #Cambridge_analytica #Recherche
Articles repérés par Hervé Le Crosnier @hlc CC BY
Écrire un commentaire
Fil @fil 21/03/2018

2

2

Disroot is a platform providing online services based on principles of freedom, privacy, federation and decentralization.
**No tracking, no ads, no profiling, no data mining!
▻https://disroot.org

Fil @fil
- b_b @b_b PUBLIC DOMAIN 21/03/2018
  
  #chatons ?
  
  b_b @b_b PUBLIC DOMAIN
Écrire un commentaire
CDB_77 @cdb_77 7/08/2017

Data Visualization
À propos de ce #cours : Learn the general concepts of #data_mining along with basic methodologies and applications. Then dive into one subfield in data mining: pattern discovery. Learn in-depth concepts, methods, and applications of pattern discovery in data mining. We will also introduce methods for pattern-based classification and some interesting applications of pattern discovery. This course provides you the opportunity to learn skills and content to practice and engage in scalable pattern discovery methods on massive transactional data, discuss pattern evaluation measures, and study methods for mining diverse kinds of patterns, sequential patterns, and sub-graph patterns.
▻https://www.coursera.org/learn/datavisualization?siteID=kLCTlhkookQ-wXuNzTbmio3I1hWUecbTDQ
#MOOC #data_visualization #visualisation #cours_online #cartographie #cursera

CDB_77 @cdb_77

Écrire un commentaire
klaus++ @klaus 13/07/2017

1

1

EU copyright reform is coming. Is your startup ready?
▻https://medium.com/silicon-allee/eu-copyright-reform-is-coming-is-your-startup-ready-4be81a5fabf7?source=user
Last Friday, members of Berlin’s startup community gathered at Silicon Allee for a copyright policy roundtable discussion hosted by Allied for Startups. The event sparked debate and elicited feedback surrounding the European Commission’s complex drafted legislation that would have significant impact on startups in the EU. Our Editor-in-Chief, Julia Neuman, gives you the rundown here — along with all the details you should know about the proposed reform.
‘Disruption’ in the startup world isn’t always a good thing — especially when it involves challenging legislation. Over the past five years, as big data and user-generated content began to play an increasing role in our society, startups have worked tirelessly to navigate laws regarding privacy and security in order to go about business as usual. Now, they may soon be adding copyright concerns to their list of potential roadblocks.
The forthcoming copyright reform proposed by the European Commission severely threatens the success and momentum that startups have gained in the EU, and it’s being introduced under the guise of “a more modern, more European copyright framework.”
On September 14, 2016, the European Commission tabled its Proposal for a Directive on Copyright in the Digital Single Market (commonly referred to as the “Copyright Directive”) — a piece of draft legislation that would have significant impact on a wide variety of modern copyrighted content. Consequently, it poses a direct threat to startups.
Members of the startup community are now coming together, unwilling to accept these measures without a fight. On Friday, members of Allied for Startups and Silicon Allee — alongside copyright experts and Berlin-based entrepreneurs and investors — met at Silicon Allee’s new campus in Mitte for a policy roundtable discussion. Additional workshop discussions are taking place this week in Warsaw, Madrid and Paris. The ultimate goal? To get startups’ voices heard in front of policymakers and counter this legislation.
Sparking conversation at Silicon Allee
Bird & Bird Copyright Lawyer and IP Professor Martin Senftleben led the roundtable discussions in Berlin, outlining key clauses and offering clarifying commentary. He then invited conversation from guests — which included representatives from content-rich startups such as Fanmiles, Videopath, and Ubermetrics. The result was a well-balanced input of perspectives and testimonials that sparked an increased desire to fight back. The roundtable covered the three main areas affected by the proposed reforms: user-generated content, text and data mining, and the neighboring right for press publishers.
User-generated content
The internet has allowed us all to become content creators with an equal opportunity to make our voices heard around the world. With this transition comes evolving personal responsibilities. Whereas in the past, copyright law only concerned a small percentage of society — today it concerns anyone posting to social media, uploading unique content, or founding a company that relies on user-generated content as part of its business model.
The proposed EU copyright reform shifts copyright burden to content providers, making them liable for user content and forcing them to apply content filtering technology to their platforms. As it stands now, management of copyright infringement is a passive process. Companies are not required to monitor or police user-generated content, instead waiting for infringement notices to initiate relevant takedowns.
New laws imply that companies would have to constantly police their platforms. As you can imagine, this would quickly rack up operating costs — not to mention deter investors from committing if there’s such a inherently persistent and high legal risk for copyright infringement. Furthermore, filtering technology would not exactly promote public interest or media plurality, as an efficiency-based filtering system would be more likely to result in overblocking and censoring (even if unintentional). This result is counter to the expressed aims of the reform.
“Having this necessity to add filtering technology from the start would kill any innovation for new startups, which is the reason why we’re all here and this economy is booming and creating jobs,” said Fabian Schmidt, Founder of Fanmiles. “The small companies suddenly cannot innovate and compete anymore.”
Text and data mining
The proposed reform also blocks startups from using text and data mining technology, consequently preventing the rich kind of data analysis that has added value and yielded deeper insights for growing startups. Copyright law today accounts for lawful access and consultation, however not for the automated process of reading and drawing conclusions. The scraping and mining of freely available texts could give rise to complex, costly legal problems from the get-go — problems that not even the most prudent founder teams could navigate (unless they work to the benefit of research institutions, which are exempt from the measure).
What kind of message does this send out to new startups? As with laws dealing with user-generated content, these measures don’t entice entrepreneurs to turn their seeds of ideas into profitable companies. Nor do they get VCs jumping to invest. Data input from mining and scraping suddenly gives rise to a huge legal issue that certainly does not benefit the public interest.
Senftleben reminded the group in Berlin that these types of legislation normally take several years to implement, and that the proposed policy could have amplified effects down the road as the role of data mining increases. “If this legislation is already limiting now, who knows what kind of text and data mining will be used in ten years and how it will play in,” he said.
Neighboring right for press publishers
The third and final point discussed at the roundtable has gathered the most media attention thus far. It’s the “elephant in the room,” unjustly pitting established publishers against startups. Proposed legislation creates an exclusive right for publishers that protects their content for digital use in order to “to ensure quality journalism and citizens’ access to information.”
Sure, this reasoning sounds like a positive contribution to a free and democratic society. But closer examination reveals that these publishers’ outdated and financially unviable business models are being grandfathered in for protection at the expense of more innovative content models.
It’s not hard to see why this is happening. Publishers have lobbying power, and they are bleeding money in today’s digital climate. “I work a lot with publishers. Their position here in Europe is a little more old school,” said one of the founders present at the discussion. “Their business model and revenues are going down, so they’re going to fight hard.”
Axel Springer, for example, is lobbying for greater protection; they want a piece of Google’s success. But the most interesting aspect of this measure is that it’s unclear how much value it would add for publishers, who already have rights to digital reproduction from the individual content creators employed under contract with their firms. A freelance journalist contributing to Die Zeit, for example, is already transferring digital reproduction rights to the newspaper just by agreeing to publish.
The drafted legislation makes it pretty clear that content aggregating search engines would take a big hit when they would inevitably have to pay content reproduction fees to publishers. But the interdependent relationship between publishers and online search aggregation services makes this legislation unlikely to generate a meaningful revenue stream for publishers anyway: Publishers want compensation for snippets of articles that show up on search engines, and search engines want compensation for bringing attention to them in the first place. In the end, content aggregators would likely just stop their use of content fragments instead of resorting to pay license fees to publishers.
It’s unclear how the proposed legislation could promote media plurality and freedom; instead, it seems to promote market concentration and monopolization of content publishing, potentially stifling free and open access to information.
“I know two small aggregators here in Germany that have given up because of this,” said Tobias Schwarz, Coworking Manager at Sankt Oberholz in Berlin.
What comes next? Turning discussion into action
What is clear now is that copyright law has potential to affect anyone. Startups in Europe, especially, are at risk with these new reforms. As players in the European economy, they have not been present in the policy debate so far. Allied for Startups and Silicon Allee are inviting founders, entrepreneurs, and interested members in the tech community to come forward and make their voices heard. They invite contributions to an open letter to the European Parliament which dives into this topic in more detail, explaining how toxic the Copyright Directive is for companies who are trying to stay alive without incurring €60 million in development costs.
“A lot of startup leaders have their heads down working on their next feature, without realizing policymakers are also creating something that can instantly kill it,” said Silicon Allee co-founder Travis Todd. “But if more startups come to the table and tell others what they learned, they will become more aware of these potential roadblocks and ultimately help change them.”
To find out more information, participate at the next discussion, or share your ideas and testimonials on this policy discussion, please get in touch! Drop a line to hello@alliedforstartups.org, tweet to @allied4startups, or join the online conversation using #copyright4startups.
- #Berlin
klaus++ @klaus
Écrire un commentaire
io2a @io2a 28/02/2017

Text and data mining (TDM) - Libre accès à l’information scientifique et technique ►http://bit.ly/2m6yG6d
▻http://io2a-watch.blogspot.com/2017/02/text-and-data-mining-tdm-libre-acces.html
Text and data mining (TDM) - Libre accès à l’information scientifique et technique ►http://bit.ly/2m6yG6d— io2a (@io2a) February 28, 2017from Twitter ►https://twitter.com/io2aFebruary 28, 2017 at 09:59AMvia IFTTT
- #data mining
- #data mining
io2a @io2a
Écrire un commentaire
klaus++ @klaus 14/10/2016

EU copyright proposal reinforces DRM
▻https://fsfe.org/news/2016/news-20160928-01.de.html
On 14 September the European Commission (EC) published its long-awaited proposal for a Directive on copyright in the Digital Single Market. While we welcome the proposal to introduce a mandatory exception for ’text and data mining’ (TDM) in the field of scientific research, we are concerned about the inclusion of a far-reaching “technical safeguards” clause granted to rightholders in order to limit the newly established exception.
The proposal grants a mandatory exception to research organisations to carry out TDM of copyrighted works to which they have lawful access. The exception is only applicable to research organisations, thus narrowing its scope and excluding everyone else with the lawful access to the copyrighted works.
According to the accompanying Impact Assessment, the TDM exception has the potential of inflicting a high number of downloads of the works, and that is why the rightholders are allowed to apply “necessary” technical measures in the name of “security and integrity” of their networks and databases.
Such a requirement, as it is proposed by the EC in the current text, gives rightholders a wide-reaching right to restrict the effective implementation of the new exception. Rightholders are free to apply whichever measure they deem “necessary” to protect their rights in the TDM exception, and to choose the format and modalities of such technical measures.
This provision will lead to a wider implementation of “digital restrictions management” (DRM) technologies. These technologies are already used extensively to arbitrarily restrict the lawful use of accessible works under the new TDM exception. This reference to “necessary technical safeguards” is excessive and can make the mandatory TDM exception useless. It is worth repeating that the exception is already heavily limited to cover only r esearch organisations with public interest.
Further reasons to forbid the use of DRM technologies in the exception are:
DRM leads to vendor lock-in. As researchers will need a specific compatible software in order to be able to access the work, they will be locked to a particular vendor or provider for arbitrary reasons. These technical safeguards will most likely stop researchers from exercising their right under the exception of using their own tools to extract data, and can lead to the factual monopoly of a handful of companies providing these technologies.
DRM excludes free software users. DRM always relies on proprietary components to work. These components, by definition, are impossible to implement in Free Software. The right of Free Software users to access resources under the exception will be violated.
DRM technologies increase the cost of research and education. Accessing DRM-protected resources typically requires purchasing specific proprietary software. Such technology is expensive and it is important to ask how much the implementation of these technologies would cost for research and educational institutions throughout Europe. Furthermore, very often this software cannot be shared, so every research workstation would need to purchase a separate copy or license for the software.
DRM artificially limits sharing between peers. A typical functionality DRM provides is to cap the number of copies you can make of documents and data. This will force different researchers to access and download data and documents several times even if they are working on the same team. This is a waste of time and resources. As DRM also typically limits the number of downloads, teams could find themselves cut of from resources they legitimately have a right to access under the exception.
We ask the European Parliament and the EU member states to explicitly forbid the use of harmful DRM practices in the EU copyright reform, especially with regard to already heavily limited exceptions.
- #European Union
klaus++ @klaus
Écrire un commentaire
Reka @reka CC BY-NC-SA 18/05/2016

Geographical Analysis, Urban Modeling, Spatial Statistics
Eleventh International Conference - GEOG-AND-MOD 16
▻http://oldwww.unibas.it/utenti/murgante/geog_An_Mod_16/index.html
http://oldwww.unibas.it/utenti/murgante/geog_An_Mod_16/go_logo.jpg
During the past decades the main problem in geographical analysis was the lack of spatial data availability. Nowadays the wide diffusion of electronic devices containing geo-referenced information generates a great production of spatial data. Volunteered geographic information activities (e.g. OpenStreetMap, Wikimapia), public initiatives (e.g. Open data, Spatial Data Infrastructures, Geo-portals) and private projects (e.g. Google Earth, Bing Maps, etc.) produced an overabundance of spatial data, which, in many cases, does not help the efficiency of decision processes. The increase of geographical data availability has not been fully coupled by an increase of knowledge to support spatial decisions. The inclusion of spatial simulation techniques in recent GIS software favoured the diffusion of these methods, but in several cases led to the mechanism based on which buttons have to pressed without having geography or processes in mind. Spatial modelling, analytical techniques and geographical analyses are therefore required in order to analyse data and to facilitate the decision process at all levels, with a clear identification of the geographical information needed and reference scale to adopt. Old geographical issues can find an answer thanks to new methods and instruments, while new issues are developing, challenging the researchers for new solutions. This Conference aims at contributing to the development of new techniques and methods to improve the process of knowledge acquisition.
The programme committee especially requests high quality submissions on the following Conference Themes :
Geostatistics and spatial simulation;
Agent-based spatial modelling;
Cellular automata spatial modelling;
Spatial statistical models;
GeoComputation,
Space-temporal modelling;
Environmental Modelling;
Geovisual analytics, geovisualisation, visual exploratory data analysis;
Visualisation and modelling of track data;
Spatial Optimization;
Interaction Simulation Models;
Data mining, spatial data mining;
Spatial Data Warehouse and Spatial OLAP;
Integration of Spatial OLAP and Spatial data mining;
Spatial Decision Support Systems;
Spatial Multicriteria Decision Analysis;
Spatial Rough Set;
Spatial extension of Fuzzy Set theory;
Ontologies for Spatial Analysis;
Urban modeling;
Applied geography;
Spatial data analysis;
Dynamic modelling;
Simulation, space-time dynamics, visualization and virtual reality.
#géographie #modélisation #statistiques

Reka @reka CC BY-NC-SA

Écrire un commentaire
Reka @reka CC BY-NC-SA 17/03/2016

5

5

A Plethora of Open Data Repositories (i.e., thousands !) - Data Science Central
▻http://www.datasciencecentral.com/profiles/blogs/a-plethora-of-open-data-repositories-i-e-thousands
A Plethora of Open Data Repositories (i.e., thousands!)
Posted by Kirk Borne on August 30, 2015 at 2:09pm
View Blog
Open data repositories are valuable for many reasons, including:
(1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets;
(2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses;
(3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; and
(4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data.
Here are some sources and meta-sources of open data:
#data #statistiques #open_data

Reka @reka CC BY-NC-SA
- Reka @reka CC BY-NC-SA 17/03/2016
  
  ▻http://data.gov
  ▻http://www.census.gov/data.html
  ▻http://www.healthdata.gov
  ▻http://www.socrata.com/resources
  ▻https://www.quandl.com
  ▻http://data.gov.uk
  ▻http://open-data.europa.eu/en/data
  ▻http://index.okfn.org/dataset
  ▻http://ohiotreasurer.gov/transparency/Ohios-Online-Checkbook
  ▻http://www.gapminder.org/data
  ▻http://aws.amazon.com/datasets
  ▻http://www.google.com/publicdata/directory
  ▻http://datacatalog.worldbank.org
  ▻https://www.kaggle.com/competitions
  ▻http://www.kdnuggets.com/datasets/index.html
  ▻http://www.crowdflower.com/data-for-everyone
  ▻https://datafloq.com/public-data
  ▻http://www.data-mania.com/blog/19-excellent-free-open-data-sources-...
  ▻http://www.datasciencecentral.com/group/data-science-apprenticeship...
  ▻https://mran.revolutionanalytics.com/documents/data
  ▻http://archive.ics.uci.edu/ml
  ▻https://kdd.ics.uci.edu
  ▻http://wiki.dbpedia.org
  ▻http://blog.visual.ly/data-sources
  ▻https://www.crowdanalytix.com/dataX
  ▻http://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public...
  ▻http://people.stern.nyu.edu/adamodar/New_Home_Page/data.html
  ▻http://www.smartdatacollective.com/bernardmarr/235366/big-data-20-f...
  ▻http://www.sisense.com/blog/free-data-sources-upgrade-business-deci...
  ▻https://sites.google.com/site/braumoellerosu/ug-stats-resource-page
  ▻http://readwrite.com/2008/04/09/where_to_find_open_data_on_the
  
  Reka @reka CC BY-NC-SA
Écrire un commentaire
Archiloque @archiloque CC BY 17/02/2016

Bosses Harness Big Data to Predict Which Workers Might Get Sick - WSJ
▻http://www.wsj.com/articles/bosses-harness-big-data-to-predict-which-workers-might-get-sick-1455664940
http://si.wsj.net/public/resources/images/BN-MQ060_PREDIC_G_20160216135133.jpg
Some firms, such as Welltok and GNS Healthcare Inc., also buy information from data brokers that lets them draw connections between consumer behavior and health needs.
Employers generally aren’t allowed to know which individuals are flagged by data mining, but the wellness firms—usually paid several dollars a month per employee—provide aggregated data on the number of employees found to be at risk for a given condition.
To determine which employees might soon get pregnant, Castlight recently launched a new product that scans insurance claims to find women who have stopped filling birth-control prescriptions, as well as women who have made fertility-related searches on Castlight’s health app.
That data is matched with the woman’s age, and if applicable, the ages of her children to compute the likelihood of an impending pregnancy, says Jonathan Rende, Castlight’s chief research and development officer. She would then start receiving emails or in-app messages with tips for choosing an obstetrician or other prenatal care. If the algorithm guessed wrong, she could opt out of receiving similar messages.

Archiloque @archiloque CC BY

Écrire un commentaire
klaus++ @klaus 19/05/2015

6

6

Top 10 data mining algorithms in plain English | rayli.net
▻http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-english
http://rayli.net/blog/wp-content/uploads/2015/05/top_10_data_mining_algorithms.jpg
Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.
Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.
What are we waiting for? Let’s get started!
Contents
1. C4.5
2. k-means
3. Support vector machines
4. Apriori
5. EM
6. PageRank
7. AdaBoost
8. kNN
9. Naive Bayes
10. CART
- #data mining
- #data mining
klaus++ @klaus
- Fil @fil 19/05/2015
  
  Petit exercice du matin : appliquer l’#algorithme PageRank aux liens « machin suit machine » de @seenthis ; pour cela j’ai utilisé ▻https://github.com/timothyasp/PageRank ; bon et en fait ça marche et j’ai rien de spécial à en dire :)
  
  Fil @fil
Écrire un commentaire
Sabine Blanc @sabineblanc CC BY-NC 17/03/2015

1

1

Data mining reveals when a yellow taxi is cheaper than Uber
▻http://www.technologyreview.com/view/535886/data-mining-reveals-when-a-yellow-taxi-is-cheaper-than-uber
#Uber #data
- #data mining
- #data mining
Sabine Blanc @sabineblanc CC BY-NC
Écrire un commentaire
Reka @reka CC BY-NC-SA 29/09/2014

4

4

Why Big Data Missed the Early Warning Signs of Ebola
▻http://www.foreignpolicy.com/articles/2014/09/26/why_big_data_missed_the_early_warning_signs_of_ebola
Merci à @freakonometrics d’avoir signalé cet article sur Twitter
ith the Centers for Disease Control now forecasting up to 1.4 million new infections from the current Ebola outbreak, what could “big data” do to help us identify the earliest warnings of future outbreaks and track the movements of the current outbreak in realtime? It turns out that monitoring the spread of Ebola can teach us a lot about what we missed — and how data mining, translation, and the non-Western world can help to provide better early warning tools.
Earlier this month, Harvard’s HealthMap service made world headlines for monitoring early mentions of the current Ebola outbreak on March 14, 2014, “nine days before the World Health Organization formally announced the epidemic,” and issuing its first alert on March 19. Much of the coverage of HealthMap’s success has emphasized that its early warning came from using massive computing power to sift out early indicators from millions of social media posts and other informal media.
#ebola #statistics #big_data
Reka @reka CC BY-NC-SA
- Fil @fil 30/09/2014
  
  By the time HealthMap monitored its very first report, the Guinean government had actually already announced the outbreak and notified the WHO.
  cf ▻http://seenthis.net/messages/286853#message286960 et ▻http://seenthis.net/messages/287766
  et sur l’impasse de #GDELT (l’auteur de l’article, Kalev H. Leetaru, étant le créateur de cette base de données) :
  Part of the problem is that the majority of media in Guinea is not published in English, while most monitoring systems today emphasize English-language material. The GDELT Project attempts to monitor and translate a cross-section of the world’s news media each day, yet it is not capable of translating 100 percent of global news coverage. It turns out that GDELT actually monitored the initial discussion of Dr. Keita’s press conference on March 13 and detected a surge in domestic coverage beginning on March 14, the day HealthMap flagged the first media mention. The problem is that all of this media coverage was in French — and was not among the French material that GDELT was able to translate those days.
  
  Fil @fil
Écrire un commentaire
Kassem @kassem CC BY-NC-SA 8/06/2013

1

1

US government invokes special privilege to stop scrutiny of data mining | World news | guardian.co.uk
▻http://www.guardian.co.uk/world/2013/jun/07/us-government-special-privilege-scrutiny-data
In comments on Friday about the surveillance controversy, Obama insisted that the secret programmes were subjected “not only to congressional oversight but judicial oversight”. He said federal judges were “looking over our shoulders”.
But civil liberties lawyers say that the use of the privilege to shut down legal challenges was making a mockery of such “judicial oversight”. Though classified information was shown to judges in camera, the citing of the precedent in the name of national security cowed judges into submission.
“The administration is saying that even if they are violating the constitution or committing a federal crime no court can stop them because it would compromise national security. That’s a very dangerous argument,” said Ilann Maazel, a lawyer with the New York-based Emery Celli firm who acts as lead counsel in the Shubert case.
“This has been legally frustrating and personally upsetting,” Maazel added. “We have asked the government time after time what is the limit to the state secrets privilege, whether there’s anything the government can’t do and keep it secret, and every time the answer is: no.”
Kassem @kassem CC BY-NC-SA
Écrire un commentaire
bp314 @bp314 PUBLIC DOMAIN 22/04/2013

Le développement de l’internet, l’informatique et le logiciel libre est du pain béni pour les régimes autoritaires du monde entier et leurs bienveillants protecteurs européens et américains :
everything a regime would need to build an incredibly intimidating digital police state—including software that facilitates data mining and real-time monitoring of citizens—is commercially available right now. What’s more, once one regime builds its surveillance state, it will share what it has learned with others. We know that autocratic governments share information, governance strategies and military hardware, and it’s only logical that the configuration that one state designs (if it works) will proliferate among its allies and assorted others. Companies that sell data-mining software, surveillance cameras and other products will flaunt their work with one government to attract new business. It’s the digital analog to arms sales, and like arms sales, it will not be cheap. Autocracies rich in national resources—oil, gas, minerals—will be able to afford it. Poorer dictatorships might be unable to sustain the state of the art and find themselves reliant on ideologically sympathetic patrons.
And don’t think that the data being collected by autocracies is limited to Facebook posts or Twitter comments. The most important data they will collect in the future is biometric information, which can be used to identify individuals through their unique physical and biological attributes. Fingerprints, photographs and DNA testing are all familiar biometric data types today. Indeed, future visitors to repressive countries might be surprised to find that airport security requires not just a customs form and passport check, but also a voice scan. In the future, software for voice and facial recognition will surpass all the current biometric tests in terms of accuracy and ease of use.
▻http://online.wsj.com/article/SB10001424127887324030704578424650479285218.html
bp314 @bp314 PUBLIC DOMAIN
Écrire un commentaire
severo @severo PUBLIC DOMAIN 19/04/2013

2

2

Scrapy at a glance — Scrapy 0.17.0 documentation
▻http://doc.scrapy.org/en/latest/intro/overview.html
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
#crawling #archival #datamining

severo @severo PUBLIC DOMAIN
- Sandburg @sandburg CC BY-SA 13/03/2017
  
  v1.2 à présent
  ▻https://doc.scrapy.org/en/1.2/intro/overview.html
  #python
  
  Sandburg @sandburg CC BY-SA
Écrire un commentaire
Reka @reka CC BY-NC-SA 17/04/2013

2

2

The differences between machine learning, data mining, and statistics
▻http://flowingdata.com/2012/12/10/the-differences-between-machine-learning-data-mining-and-statistics
The differences between machine learning, data mining, and statistics
December 10, 2012 to Statistics by Nathan Yau
From machine learning to data mining. From statistics to probability. A lot of it seems similar, so what are the differences? Statistician William Briggs explains in an FAQ.
What’s the difference between machine learning, deep learning, big data, statistics, decision & risk analysis, probability, fuzzy logic, and all the rest?
#data #données #statistiques #visualisation #cartographie
None, except for terminology, specific goals, and culture. They are all branches of probability, which is to say the understanding and sometime quantification of uncertainty. Probability itself is an extension of logic.
Reka @reka CC BY-NC-SA
Écrire un commentaire
Fil @fil 27/11/2012

Researchers mine 2.5M news articles to prove what we already know — Data | GigaOM
►http://gigaom.com/data/researchers-mine-2-5m-news-articles-to-prove-what-we-already-know
A group of British researchers recently analyzed 2.5 million newspaper articles in order to prove that new data analysis techniques, such as machine learning and natural-language processing, can accurately classify media content.
#data_mining #text_mining

Fil @fil

Écrire un commentaire
Fil @fil 25/10/2012

#Big_Data Hype (and Reality) - Gregory Piatetsky-Shapiro - Harvard Business Review
►http://blogs.hbr.org/cs/2012/10/big_data_hype_and_reality.html
les espoirs de prédire les comportements par les #données se cassent rapidement les dents sur l’imprévisibilité fondamentale de l’humain
The winning algorithm was a very complex ensemble of many different approaches — so complex that it was never implemented by Netflix. With three years of effort by some of the world’s best data mining scientists, the average prediction of how a viewer would rate a film improved by less than 0.1 star.
►http://blogs.hbr.org/cs/assets_c/2012/10/piatetskychart-thumb-372x181-2509.jpg
#algorithmes #marketing

Fil @fil

Écrire un commentaire
Articles repérés par Hervé Le Crosnier @hlc CC BY 15/10/2012

1

1

Do-Not-Track Movement Is Drawing Advertisers’ Fire - NYTimes.com
►http://www.nytimes.com/2012/10/14/technology/do-not-track-movement-is-drawing-advertisers-fire.html?nl=todaysheadlines&e
“If we do away with this relevant advertising, we are going to make the Internet less diverse, less economically successful, and frankly, less interesting,” says Mike Zaneis, the general counsel for the Interactive Advertising Bureau, an industry group.
But privacy advocates argue that in a digital ecosystem where there may be dozens of third-party entities on an individual Web page, compiling and storing information about what a user reads, searches for, clicks on or buys, consumers should understand data mining’s potential costs to them and have the ability to opt out.

Articles repérés par Hervé Le Crosnier @hlc CC BY

Écrire un commentaire
juba @julien CC BY 20/05/2011

KNIME | KNIME
►http://www.knime.org/knime
KNIME, pronounced [naim], is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.
(...)
The KNIME base version already incorporates over 100 processing nodes for data I/O, preprocessing and cleansing, modeling, analysis and data mining as well as various interactive views, such as scatter plots, parallel coordinates and others. It integrates all analysis modules of the well known Weka data mining environment and additional plugins allow R-scripts to be run, offering access to a vast library of statistical routines.
KNIME is based on the Eclipse platform and, through its modular API, easily extensible. When desired, custom nodes and types can be implemented in KNIME within hours thus extending KNIME to comprehend and provide first-tier support for highly domain-specific data. This modularity and extensibility permits KNIME to be employed in commercial production environments as well as teaching and research prototyping settings. If you would like to read a more detailed description of the software, please download the KNIME White Paper.
KNIME is released under a dual licensing scheme. The open source license (GPL) allows KNIME to be downloaded, distributed, and used freely.
http://www.knime.org/files/highlighting.png
#data_mining #statistiques
juba @julien CC BY
Écrire un commentaire