python - Downloading then deleting many files which together are very large, issues? -


itemprop = "text">

module pypdf2 and urlib I make a lot of planning There are many uses in Python. Massively (text) analysis of pdf files using my current plan pypdf2 to download files using urlib , save them on your computer, then open them / Extract text is

The .pdf file is in size from 10-500 MB, which (that is ~ 16000 PDF files) means that the level of this project will be on GB to TB scale. Extracted data will be large, not just a tag set / count of word associations, but will be an issue for the .pdf file itself.

I'm not planning to download them at once, but iteratively so that my system is not overwhelmed. Below is a high level workflow:

pdf_url in all_list: download_using_urlib text (reader) (pdf_url)

Most codes are already written and I can post it if it is relevant. My question is: can there be a problem in my computer by collecting and removing 8 TB data on my HD? As you can see, I am not collecting it once, but I am just a little worried because I have never done anything with this scale before. If this is an issue, how can I structure my project to avoid it?

Thank you!

I would say that you can only see PDF as a storage in memory such as you download them We do. This can be a good way to handle. You will keep the file in memory and read it and then discard the file. This will save your HD from intense content that is too many.

You can also use urlib as opposed to, it is more comfortable than urlib, and as bonus, both work on Python 2 and 3.

Comments

Popular posts from this blog

c# - ASP.NET MVC - Attaching an entity of type 'MODELNAME' failed because another entity of the same type already has the same primary key value -

jasper reports - How to center align barcode using jasperreports and barcode4j -

django - CommandError: You must set settings.ALLOWED_HOSTS if DEBUG is False -