Skip to content

gh-111791: delegating extraction to zipfile module's extractall() method #111824

Closed
sepastian wants to merge 10 commits into
python:mainfrom
sepastian:shutil_unpack_archive_false_positives
Closed

gh-111791: delegating extraction to zipfile module's extractall() method #111824
sepastian wants to merge 10 commits into
python:mainfrom
sepastian:shutil_unpack_archive_false_positives

Conversation

@sepastian
Copy link
Copy Markdown

@sepastian sepastian commented Nov 7, 2023

shutil.unpack_archive: deletage extracting ZIP files to zipfile (#111791)

gh-111791: rely on `zipfile.extractall` when extraxting ZIP archives, 
           preventing false negatives and code duplication

As reported in #111791, if the path of a file inside a ZIP file contains "..", e.g. myfile..txt (probably misspelled), shutil.unpack_archive will silently skip extracting the file, because it wrongly assumes a relative path.

This is problematic for two reasons:

  1. The current implementation of shutil.unpack_archive wrongly identifies relative path components. Scanning for ".." does not tell whether a path contains relative components, or not; one must scan for "../" instead.
  2. Besides, files inside a ZIP archive described through relative paths should be extracted, see below.

Python's own zipfile module and the unzip are handling relative path components "../"and names containins ".." correctly. For reference, the man unzip page says:

For security reasons, unzip normally removes parent dir path components (../) from the names of extracted file. This safety feature (new for version 5.50) prevents unzip from accidentally writing files to sensitive areas outside the active extraction folder tree head.

Solution: delegate extracting ZIP archives to Python's own zipfile.extractall method.

Appendix

The following example shows, how extracting a ZIP archive containing paths containing relative components "../" and files with names containing ".." differs in shutil.unpack_archive, zipfile.extractall and the Linux tool unzip.

# Create some files and directories.
$ cd /tmp
$ mkdir -p a/b/c
$ touch a/a.txt a/a..txt a/b/b.txt a/b/b..txt a/b/c/c.txt a/b/c/c..txt
$ find a
a
a/b
a/b/b.txt
a/b/b..txt
a/b/c
a/b/c/c.txt
a/b/c/c..txt
a/a.txt
a/a..txt

# Create a ZIP file for testing.
$ cd /tmp/a/b
$ zip /tmp/test.zip ../a.txt /tmp/a/b/c/c.txt ../../a/b/b.txt

# Inspect the newly created ZIP file.
$ zipinfo /tmp/test.zip
zipinfo /tmp/test.zip 
Archive:  /tmp/test.zip
Zip file size: 482 bytes, number of entries: 3
-rw-rw-r--  3.0 unx        0 bx stor 23-Nov-07 16:03 ../a.txt
-rw-rw-r--  3.0 unx        0 bx stor 23-Nov-07 16:04 tmp/a/b/c/c.txt
-rw-rw-r--  3.0 unx        0 bx stor 23-Nov-07 16:03 ../../a/b/b.txt
3 files, 0 bytes uncompressed, 0 bytes compressed:  0.0%

# Prepare directories holding extraction results.
$ mkdir -p /tmp/extract/unzip /tmp/extract/zipfile /tmp/extract/shutil

# Extract using 'unzip' extracts all files 
# and warns about relative path components.
$ cd /tmp/extract/unzip
$ unzip /tmp/test.zip
Archive:  /tmp/test.zip
warning:  skipped "../" path component(s) in ../a.txt
 extracting: a.txt                   
 extracting: tmp/a/b/c/c.txt         
warning:  skipped "../" path component(s) in ../../a/b/b.txt
 extracting: a/b/b.txt               
$ find
.
./a
./a/b
./a/b/b.txt
./tmp
./tmp/a
./tmp/a/b
./tmp/a/b/c
./tmp/a/b/c/c.txt
./a.txt

# Extract using 'zipfile' (CLI).
$ cd /tmp/extract/zipfile
$ python3 -m zipfile -e /tmp/test.zip .
$ find
.
./a
./a/b
./a/b/b.txt
./tmp
./tmp/a
./tmp/a/b
./tmp/a/b/c
./tmp/a/b/c/c.txt
./a.txt

# Extract using 'shutil.unpack_archive'.
$ cd /tmp/extract/shutil
$ python3 -c 'import shutil; shutil.unpack_archive("/tmp/test.zip",".")' 
$ find
.
./tmp
./tmp/a
./tmp/a/b
./tmp/a/b/c
./tmp/a/b/c/c.txt

# As can be seen, and as can be confirmed using `diff`,
# zipfile and unzip produce the same result,
# while shutil silently skips paths containing "..",
# i.e. both names containing ".." and paths containing
# actual relative components.

# Diff.
$ cd /tmp/extract
$ diff -r zipfile/ unzip/

$ diff -r unzip/ shutil/
Only in unzip/: a
Only in unzip/: a..txt
Only in unzip/: a.txt
Only in unzip/: tmp

…to zipfile

shutil.unpack_archive fails, if file name contains '..'; zipfile handles everything correctly, i.e. in the same way than 'unzip'; let zipfile unpack archives, instead of reinventing the wheel here
@ghost
Copy link
Copy Markdown

ghost commented Nov 7, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Nov 7, 2023

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Nov 7, 2023

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@sepastian sepastian changed the title fixing #111791: delegating extraction to zipfile delegating extraction to zipfile, fixing #111791 Nov 7, 2023
@sepastian sepastian changed the title delegating extraction to zipfile, fixing #111791 gh-111791: delegating extraction to zipfile module's extractall() method Nov 7, 2023
@sepastian
Copy link
Copy Markdown
Author

This is a bug that needs to be fixed. Any progress on this?

@python-cla-bot
Copy link
Copy Markdown

python-cla-bot Bot commented Apr 18, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions Bot added the stale Stale PR or inactive for long period of time. label Apr 14, 2026
@serhiy-storchaka
Copy link
Copy Markdown
Member

Thank you for your contribution @sepastian and sorry for missing your PR. Similar solution was applied in #146591.

I have noticed that your PR fixes also a part of the docs which I missed. We will fix it too.

@sepastian
Copy link
Copy Markdown
Author

I detected this problem and provided a fix for this 3 years ago. I asked you specifically, @serhiy-storchaka, about the status of this, that was two years ago. You never replied to me. Although you referenced my PR elsewhere. See #111824 (comment), for example.

And now you are merging a copy of my solution, which is only 2 months old, and make this an official contribution?

What's going on here?

@serhiy-storchaka
Copy link
Copy Markdown
Member

I admit my guilt.

That issue looked like an ordinary bug affecting not many people, until a security issue was reported few months ago, to which this was the simplest solution. I completely forgot about your PR! It seems my memory is failing me. I reviewed almost 2000 PRs in the last 3 years, have more than 100 PRs in my backlog, and add new PRs to it every day. Add to this that some issues need several days to work. It's inevitable that I'll fail many reviews. My apologies.

@sepastian
Copy link
Copy Markdown
Author

I have noticed that your PR fixes also a part of the docs which I missed. We will fix it too.

What does that mean exactly? I take it for granted that you will apply my PR here, at least.

@sepastian
Copy link
Copy Markdown
Author

I admit my guilt.

That issue looked like an ordinary bug affecting not many people, until a security issue was reported few months ago, to which this was the simplest solution. I completely forgot about your PR! It seems my memory is failing me. I reviewed almost 2000 PRs in the last 3 years, have more than 100 PRs in my backlog, and add new PRs to it every day. Add to this that some issues need several days to work. It's inevitable that I'll fail many reviews. My apologies.

Thank you for your time and all the work you put into Python - seriously! My PR is not even tiny compared to that. Still, I put in a lot of work. If my PR had been found not good enough for merging, I would totally understand. If I had received some feedback at the very least, also good. But I have been ignored for three years, even after asking several times. Just silence.

Today I have to learn that my idea and solution has been taken and merged under a different name. I understand what you are saying and, among all the things you are doing, this happened. Again, highly appreciate what you do for this project.

This is not about blaming someone, I'm just wondering how this can be fixed. Because the way it is now, I'm mad and disappointed. From what you wrote above, I get the impression you are even updating the docs with what I wrote, without mentioning me. This is not acceptable.

@serhiy-storchaka
Copy link
Copy Markdown
Member

serhiy-storchaka commented May 18, 2026

Your PR was not not good enough. It just fell out of the attention span. Bad luck.

For now, there are 2000 open PRs and 69000 closed PRs. Almost 1000 new PRs created each month, 70% of them are closed in the first month, 90% are closed in the first year, but there are PRs waiting for several years.

I try to clean languishing PRs. Your PR somehow fell through my filters. This is a sad lesson for me.

I updating the docs in different way: #149994.

@StanFromIreland
Copy link
Copy Markdown
Member

Almost 1000 new PRs created each month, 70% of them are closed in the first month, 90% are closed in the first year, but there are PRs waiting for several years.

FYI, I did some calculations a few weeks ago (excluding backports):

pr_fate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review stale Stale PR or inactive for long period of time.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants