Add timestamp support

author Oliver Matthews <oliver@codersoffortune.net>

Sat, 2 Nov 2019 09:46:58 +0000 (09:46 +0000)

committer Oliver Matthews <oliver@codersoffortune.net>

Sat, 2 Nov 2019 09:46:58 +0000 (09:46 +0000)
author Oliver Matthews <oliver@codersoffortune.net>
Sat, 2 Nov 2019 09:46:58 +0000 (09:46 +0000)
committer Oliver Matthews <oliver@codersoffortune.net>
Sat, 2 Nov 2019 09:46:58 +0000 (09:46 +0000)
diff --git a/README.md b/README.md

index 5687736..ddc207c 100644 (file)
--- a/README.md
+++ b/README.md
@@ -2,12 +2,11 @@
  Script for archiving thingiverse things. Due to this being a glorified webscraper, it's going to be very fragile.
  
  ## Usage:
-`thingy_grabber.py user_name collection_name`
+`thingy_grabber.py [-v] user_name collection_name`
  
  Where `user_name` is the name of the creator of the collection (not nes. your name!) and `collection_name` is the name of the collection you want.
  
  This will create a series of directorys `user-collection/thing-name` for each thing in the collection.
-If a thing's directory already exists, it will be skipped.
  
  If for some reason a download fails, it will get moved sideways to `thing-name-failed` - this way if you rerun it, it will only reattmpt any failed things.
  
@@ -16,10 +15,14 @@ python3, beautifulsoup4, requests, lxml
  
  ## Current features:
  - can download an entire collection, creating seperate subdirs for each thing in the collection
+- If you run it again with the same settings, it will check for updated files and only update what has changed. This should make it suitible for syncing a collection on a cronjob
+CAVEAT: This script will *not delete files*. So if there has been an update and some files have been moved or renamed, they will be mixed in with the old stuff.
  
-## Todo features:
+
+## Todo features (maybe):
  - download a single thing
  - download things by designer
  - less perfunctory error checking / handling
  - attempt to use -failed dirs for resuming
-- detect updated models and redownload them
+- pull down images as well
+- handle old/deleted files on update
diff --git a/thingy_grabber.py b/thingy_grabber.py

index f2b57d8..587d47e 100755 (executable)
--- a/thingy_grabber.py
+++ b/thingy_grabber.py
@@ -117,25 +117,57 @@ def download_thing(thing):
      try:
          os.mkdir(title)
      except FileExistsError:
-        print("Directory for {} ({}) already exists, skipping".format(thing, title))
-        return
+        pass
+
      print("Downloading {} ({})".format(thing, title))
      os.chdir(title)
+    last_time = None
+
+    try:
+        with open('timestamp.txt', 'r') as fh:
+            last_time = fh.readlines()[0]
+        if VERBOSE:
+            print("last downloaded version: {}".format(last_time))
+    except FileNotFoundError:
+        # Not run on this thing before.
+        if VERBOSE:
+            print('Directory for thing already exists, checking for update.')
+        last_time = None
  
      file_links = file_soup.find_all('a', {'class':'file-download'})
-    files = [("{}{}".format(URL_BASE, x['href']), x["title"]) for x in file_links]
+    new_last_time = last_time
+    new_file_links = []
+
+    for file_link in file_links:
+        timestamp = file_link.find_all('time')[0]['datetime']
+        if VERBOSE:
+            print("Checking {} (updated {})".format(file_link["title"], timestamp))
+        if not last_time or timestamp > last_time:
+            new_file_links.append(file_link)
+        if not new_last_time or timestamp > new_last_time:
+            new_last_time = timestamp
+
+    if last_time and new_last_time <= last_time:
+        print("Thing already downloaded. Skipping.")
+    files = [("{}{}".format(URL_BASE, x['href']), x["title"]) for x in new_file_links]
  
      try:
          for url, name in files:
+            if VERBOSE:
+                print("Downloading {} from {}".format(name, url))
              data_req = requests.get(url)
              with open(name, 'wb') as handle:
                  handle.write(data_req.content)
+        # now write timestamp
+        with open('timestamp.txt', 'w') as fh:
+            fh.write(new_last_time)
      except Exception as exception:
          print("Failed to download {} - {}".format(name, exception))
          os.chdir(base_dir)
          os.rename(title, "{}_failed".format(title))
          return
  
+
      os.chdir(base_dir)
  
  def main():
author	Oliver Matthews <oliver@codersoffortune.net>
	Sat, 2 Nov 2019 09:46:58 +0000 (09:46 +0000)
committer	Oliver Matthews <oliver@codersoffortune.net>
	Sat, 2 Nov 2019 09:46:58 +0000 (09:46 +0000)
README.md		patch \| blob \| blame \| history
thingy_grabber.py		patch \| blob \| blame \| history