babbage is a library for easily gathering data and computing summary measures in a declarative way.
The summary measure functionality allows you to compute multiple measures over arbitrary partitions of your input data simultaneously and in a single pass. You just say what you want to compute:
> (def my-fields {:y (stats :y count)
:x (stats :x count)
:both (stats #(+ (or (:x %) 0) (or (:y %) 0)) count sum mean)})
and the sets that are of interest: > (def my-sets (-> (sets {:has-y #(contains? % :y})
(complement :has-y))) ;; could also take intersections, unions
And then run it with some data: > (calculate my-sets my-fields [{:x 1 :y 2} {:x 10} {:x 4 :y 3} {:x 5}])
{:not-has-y
{:y {:count 0}, :x {:count 2}, :both {:mean 7.5, :sum 15, :count 2}},
:has-y
{:y {:count 2}, :x {:count 2}, :both {:mean 5.0, :sum 10, :count 2}},
:all
{:y {:count 2}, :x {:count 4}, :both {:mean 6.25, :sum 25, :count 4}}}
The functions :x, :y, and #(+ (or (:x %) 0) (or (:y %) 0)) defined in
the fields map are called once per input element no matter how many
sets the element contributes to. The function #(contains? % y) is also
called once per input element, no matter how many unions,
intersections, complements, etc. the set :has-y contributes to.A variety of measure functions, and structured means of combining them, are supplied; it's also easy to define additional measures.
babbage also supplies a method for running computations structured as dependency graphs; this can make gathering the initial data for summarizing simpler to express. To give an example that's probably familiar from another context:
> (defgraphfn sum [xs]
(apply + xs))
> (defgraphfn sum-squared [xs]
(sum (map #(* % %) xs)))
> (defgraphfn count-input :count [xs]
(count xs))
> (defgraphfn mean [count sum]
(double (/ sum count)))
> (defgraphfn mean2 [count sum-squared]
(double (/ sum-squared count)))
> (defgraphfn variance [mean mean2]
(- mean2 (* mean mean)))
> (run-graph {:xs [1 2 3 4]} sum variance sum-squared count-input mean mean2)
{:sum 10
:count 4
:sum-squared 30
:mean 2.5
:variance 1.25
:mean2 7.5
:xs [1 2 3 4]}
Options are provided for parallel, sequential, and lazy computation of
the elements of the result map, and for resolving the dependency graph
in advance of running the computation for a given input, either at
runtime or at compile time.