• Fathom: a framework for understanding web pages

    https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages

    #Readability, the basis of Safari and Firefox’s reader modes, is 1,800 lines of #JavaScript and was recently shut down. Chrome’s DOM Distiller is 23,000 lines of Java.But what if understanders were cheap to write? What if Readability could be implemented in just 4 simple rules?That scores within 7% of Readability’s output on a selection of its own test cases, measured by Levenshtein distance1. The framework enabling this is Fathom, and it drives the cost of writing understanders through the floor.Fathom is a mini-language for writing semantic extractors. The sets of rules that make up its programs are embedded in JavaScript, so you can use it client- or server-side as privacy dictates.

    const rules = ruleset(
       rule(dom('p,div,li,code,blockquote,pre,h1,h2,h3,h4,h5,h6'),
            props(scoreByLength).type('paragraphish')),
       rule(type('paragraphish'),
            score(fnode => (1 - linkDensity(fnode,
                                            fnode.noteFor('paragraphish')
                                                 .inlineLength))
                           * 1.5)),
       rule(dom('p'),
            score(4.5).type('paragraphish')),
       rule(type('paragraphish')
               .bestCluster({splittingDistance: 3,
                             differentDepthCost: 6.5,
                             differentTagCost: 2,
                             sameTagCost: 0.5,
                             strideCost: 0}),
            out('content').allThrough(domSort))
    );

    Fathom is a JavaScript framework for #extracting meaning from web pages, identifying parts like Previous/Next buttons, address forms, and the main textual content—or classifying a page as a whole. Essentially, it scores #DOM nodes and extracts them based on conditions you specify. A Prolog-inspired system of types and annotations expresses dependencies between scoring steps and keeps state under control. It also provides the freedom to extend existing sets of scoring rules without editing them directly, so multiple third-party refinements can be mixed together.

    https://mozilla.github.io/fathom