Uploaded image for project: 'SimplyE 2.0'
  1. SimplyE 2.0
  2. SIMPLY-2177

Attach coarse-grained location to analytics events

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Medium Medium
    • 3.0.2cm
    • None
    • None
    • S19 SIMPLY September 3 - 17, S20 SIMPLY Sep 17 - Oct 1, S21 SIMPLY Oct 1 - Oct 15

      A library should be able to measure the geographic distribution of usage of its circulation manager. The system must respect patron privacy through the process of measuring location information, storing it, using it to generate reports, and finally expunging it.

      This story covers the minimal changes to the product necessary to implement the program at NYPL that's motivating this change. The design is with an eye towards future expansion for use by other libraries, and excludes operational issues specific to how NYPL will use the feature.

      Taking the measurement

      The default behavior for a library is not to associate requests with locations at all. If a library chooses to associate incoming requests with locations, there are three ways of doing this:

      1. The IP address of the request can be mapped to a latitude/longitude pair using a geolocation database. This gives a precise but inaccurate measurement of the location of the device at the time the request was made.
      2. The ILS that authenticated a patron's request (assuming it is authenticated) may know something about the patron's home address. This gives an accurate but imprecise measurement of the neighborhood where the patron lives.
      3. The ILS that authenticated a patron's request (assuming it is authenticated) may know something about the patron's preferred branch of the library system. This gives an accurate and precise measurement of the neighborhood where the patron most often makes use of the library. This is probably near their home or near their workplace.

      In general, we value accuracy over precision, but any given library will only have one or two ways of taking this measurement.

      For MVP, the only technique we need to support is #2. In particular, we need to be able to extract a ZIP code from a Sierra patron record.

      Associating the measurement with a circulation event

      If a request is in fact associated with a location, and the request spawns a circulation event, the location may be associated with the circulation event – it depends on the type of event.

      To start with, we're going to associate location with these three types of events:

      • circulation_manager_check_out
      • circulation_manager_fulfill
      • open_book

      Theoretically, any event initiated by the circulation manager in the course of serving an HTTP request can have an associated location, but "new_patron" is the only one not covered here where that information might be useful.

      Locations are not associated with patron data in the circulation manager database (though this is probably part of a patron's ILS record). I'm pretty sure that circulation events are never currently associated with patrons, but just in case: no circulation event may be associated with both a patron ID and a location.

      Propagating the event

      Only local analytics sinks will receive the location associated with a circulation event.

      The location associated with a circulation event not be propagated through any third-party service, e.g. Google Analytics. Those analytics sinks will receive circulation events sans location. Basically, we just don't trust third-party services with that data. We don't know what other data a third-party service has available or how useful a location- and date-tagged event would be in "improving" it.

      Reporting on events

      Reporting is the main value proposition of third-party analytics services. Since we're not sending location information to analytics services, we need to improve our local reporting functionality.

      We currently have an admin interface UI for generating a CSV file containing information on all circulation events for a specific date. This UI needs to be generalized and improved so that we can specify a date range. We'll also need to add a filter by location, so that we only see events that happened in certain locations.

      From a UI perspective this shouldn't be too difficult, but we know that the existing CSV report has serious performance problems generating a report for a single day's activity – it takes so long to generate the report that the admin interface HTTP request times out. We don't know if the bottleneck is in the database or in the generation of a CSV with tens of thousands of rows – an aggregated report will have many fewer rows.

      We may need to set up a system where reports are generated in the background on demand, or generated on a regular basis.

      Expunging the location data

      The location associated with a circulation event should be wiped one year after the event is collected. The event itself can stay; only the associated location needs to go. This can be handled with a reaper.

       

       

       

            leonardrichardson Leonard Richardson [X] (Inactive)
            andrewshelton Andrew Shelton
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: