#web_archive

  • The Archives Unleashed Toolkit - The Archives Unleashed Project
    https://archivesunleashed.org/aut

    The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on Apache Spark, which provides powerful tools for analytics and data processing.

    « A quoi ça sert ? » : https://aut.docs.archivesunleashed.org

    Le site du projet « The Archives Unleashed » : https://archivesunleashed.org

    The Archives Unleashed project aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Supported by The Andrew W. Mellon Foundation, we develop web archive search and data analysis tools to enable scholars, librarians and archivists to access, share, and investigate recent history since the early days of the World Wide Web.

    La doc du toolkit : https://aut.docs.archivesunleashed.org/docs/home

    Le repo github : https://github.com/archivesunleashed/aut

    #The_Archives_Unleashed #archive #WARC #internet_archive #open_source #data_visualisation #web_archive

  • BBC - Future - Why there’s so little left of the early internet
    http://www.bbc.com/future/story/20190401-why-theres-so-little-left-of-the-early-internet

    Tew, who now runs the meditation and mindfulness app Calm, indeed became a millionaire. But the homepage he created has also become something else: a living museum to an earlier internet era. Fifteen years may not seem a long time, but in terms of the internet it is like a geological age. Some 40% of the links on the Million Pixel Homepage now link to dead sites. Many of the others now point to entirely new domains, their original URL sold to new owners.

    The Million Dollar Homepage shows that the decay of this early period of the internet is almost invisible. In the offline world, the closing of, say, a local newspaper is often widely reported. But online sites die, often without fanfare, and the first inkling you may have that they are no longer there is when you click on a link to be met with a blank page.

    You could, quite reasonably, assume that if I ever needed to show proof of my time there it would only be a Google search away. But you’d be wrong. In April 2013, AOL abruptly closed down all its music sites – and the collective work of dozens of editors and hundreds of contributors over many years. Little of it remains, aside from a handful of articles saved by the Internet Archive, a San Francisco-based non-profit foundation set up in the late 1990s by computer engineer Brewster Kahle.

    It is the most prominent of a clutch of organisations around the world trying to rescue some of the last vestiges of the first decade of humanity’s internet presence before it disappears completely.

    Dame Wendy Hall, the executive director of the Web Science Institute at the University of Southampton, is unequivocal about the archive’s work: “If it wasn’t for them we wouldn’t have any” of the early material, she says. “If Brewster Kahle hadn’t set up the Internet Archive and started saving things – without waiting for anyone’s permission – we’d have lost everything.”

    One major problem with trying to archive the internet is that it never sits still. Every minute – every second – more photos, blog posts, videos, news stories and comments are added to the pile. While digital storage has fallen drastically in price, archiving all this material still costs money. “Who’s going to pay for it?” asks Dame Wendy. “We produce so much more material than we used to.”

    “The Internet Archive first started archives pages in 1996. That’s five years after the first webpages were set up. There’s nothing from that era that was ever copied from the live web.” Even the first web page set up in 1991 no longer exists; the page you can view on the World Wide Web Consortium is a copy made a year later.

    “I think there’s been very low level of awareness that anything is missing,” Webber says. “The digital world is very ephemeral, we look at our phones, the stuff on it changes and we don’t really think about it. But now people are becoming more aware of how much we might be losing.”

    We consider the material we post onto social networks as something that will always be there, just a click of a keyboard away. But the recent loss of some 12 years of music and photos on the pioneering social site MySpace – once the most popular website in the US – shows that even material stored on the biggest of sites may not be safe.

    #Archive #Web_archive #Brewster_Kahle #Internet_Archive