Uploaded image for project: 'SimplyE 2.0'
  1. SimplyE 2.0
  2. SIMPLY-2266

Current rules for importing OPDS feeds can lead to very expensive metadata wrangler queries

XMLWordPrintable

    • S19 SIMPLY September 3 - 17

      When you run a monitor that does an OPDS import, the timestamp for the monitor is set to the latest <updated> tag found in one of the <entry> tags.

      If there are no <entry> tags, the timestamp is left alone.

      Now, consider how that works when going against the metadata wrangler.

      In most library collections, there are always new books being added and bring processed, so the timestamp is always moving forward, a little bit at a time.

      But the Enki collection hasn't had a new book added since 2017. That's fine, but it means that the metadata sync process for an Enki collection keeps asking for all the new books since 2017.

      This puts enough strain on the metadata wrangler to temporarily bring down one of the servers. The other server has more RAM and is able to serve an empty feed in response:

       

      <feed xmlns:bibframe="http://bibframe.org/vocab/" xmlns:drm="http://librarysimplified.org/terms/drm" xmlns:app="http://www.w3.org/2007/app" xmlns:bib="http://bib.schema.org/" xmlns:opds="http://opds-spec.org/2010/catalog" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:simplified="http://librarysimplified.org/terms/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:schema="http://schema.org/" xmlns="http://www.w3.org/2005/Atom">
       <id>https://metadata.librarysimplified.org/Ulc1cmFRPT06Ulc1cmFRPT0%3D/updates?last_update_time=2017-11-21T21%3A59%3A03Z</id>
       <title>Enki Collection Updates for [circ manager]</title>
       <updated>2019-09-03T15:09:08Z</updated>
       <link href="https://metadata.librarysimplified.org/abcde/updates?last_update_time=2017-11-21T21%3A59%3A03Z" rel="self"/>
      </feed>
      
       
      

      I'll look into whether the metadata wrangler side can be improved, but if you ask for all the updates since 2017 and there haven't been any, it seems reasonable to bump up the timestamp a little bit, so that you don't ask the exact same question two hours later.

      One obvious rule is to consider the <feed>'s <updated> tag as a potential timestamp. I don't know why we didn't do that initially – maybe we were worried about losing updates due to work that was incomplete at the time the feed was generated. But we could subtract a month from the feed-level <updated> tag and still solve this problem, without (I think) introducing any risk of lost updates.

            leonardrichardson Leonard Richardson [X] (Inactive)
            leonardrichardson Leonard Richardson [X] (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: