Accessing the Data | CommonCrawl

hubertguillaud CC BY 31/01/2013

Une base de donnée ouverte du web pourrait donner naissance au prochain Google - Technology Review
▻http://www.technologyreview.com/news/509931/a-free-database-of-the-entire-web-may-spawn-the-next-google

Common Crawl - ►http://commoncrawl.org - utilise un web crawler pour faire une copie géante du web et le rendre accessible à tous. L’idée rendre accessible des ressources qui permettrait de rivaliser avec Google. Hébergé dans les nuages d’Amazon, la base de donnée permet à un programmeur pour 25 $ d’y accéder. Le système est utilisé notamment par TinEye, un moteur de recherche d’image inversé. En tout cas, Common Crawl s’annonce comme un outil précieux pour les start-ups. Tags : (...)

#moteurderecherche #opensource

#Google
#USD

hubertguillaud CC BY

Fil @fil 31/01/2013

Common Crawl URL Index
▻http://commoncrawl.org/common-crawl-url-index
If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars.
#indexation #archivage #données #web

Fil @fil
Fil @fil 31/01/2013

Elbaz says he noticed around five years ago that researchers with new ideas about how to use Web data felt compelled to take jobs at #Google because it was the only place they could test those ideas.
l’#open_source contre le #monopole_intellectuel

Fil @fil
Fil @fil 31/01/2013

Désolé de spammer, c’est vraiment très intéressant : le projet suggère (fortement) d’employer #EC2 pour traiter les données, qui sont enregistrées sur #S3 (▻http://commoncrawl.org/data/accessing-the-data) ; comme si tout le monde s’installait dans la même salle informatique (chez #amazon) et partageait non seulement les données mais aussi les capacités de traitement, pour un coût minimal.

Fil @fil
hubertguillaud @hubertguillaud CC BY 31/01/2013

@fil, non non, c’est pas du Spam des commentaires pareils !

hubertguillaud @hubertguillaud CC BY

Écrire un commentaire

Accessing the Data | CommonCrawl

/accessing-the-data