top of page
Search

How to batch download documents

  • maxwellapex
  • Aug 16
  • 1 min read
ree

This case is about batch downloading papers from a paid account, and the algorithm is not difficult.

1.      Entering user name and password, and pass cookies to the customized “browser”.

2.      Once logged in, fetch the html structure for the website, and this can show you how many pages, links or items are listed.

3.      Crawling those links and filter out invalid ones. Sometimes (in this case) a link will bring me to another page, so I need to dig deeper.

4.      Download documents, and use the name shown on the URL to rename files. You don’t want files look like file1, file2…

5.      Checking if there is any missing file and duplicated files. Log those for further processing.

6.      Most of the time you get 90% plus of successfully downloaded documents. Then one can repeat the whole process, or manually deal the rest.

7.      To accelerate, considering the multi treads function. I did this because I have more than thousands tiny files to download.

I am pretty sure there are lots of tools and scripts doing the same thing.  But concerning the time for finding them, making a new one can save my time.


 
 
 

Comments


bottom of page