Hacker News new | past | comments | ask | show | jobs | submit login

This doesn't quite get all the way there, but the failure is being caused by the closure of a _io.FileIO due to GC, which happens to have the same fd as the sqlite database. The closure happens in the middle of a SQLite operation, which causes a subsequent flock() call to fail.

strace log from the failure:

    open("/home/travis/apprise-api/.coverage", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 31
    fstat(31, {st_mode=S_IFREG|0644, st_size=192512, ...}) = 0
    fstat(31, {st_mode=S_IFREG|0644, st_size=192512, ...}) = 0
    lseek(31, 0, SEEK_SET)                  = 0
    read(31, "SQLite format 3\0\4\0\1\1\0@  \0\0\0\n\0\0\0\274"..., 100) = 100
    write(3, "00001299 0001 6169 execute('prag"..., 54) = 54
    write(3, "Executing 'pragma journal_mode=o"..., 36) = 36
    write(3, "self: <SqliteDb @0x7fa0feef9fd0 "..., 86) = 86
    fcntl(31, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
    fcntl(31, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510}) = 0
    fcntl(31, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
    fstat(31, {st_mode=S_IFREG|0644, st_size=192512, ...}) = 0
    fstat(31, {st_mode=S_IFREG|0644, st_size=192512, ...}) = 0
    lseek(31, 0, SEEK_SET)                  = 0
    read(31, "SQLite format 3\0\4\0\1\1\0@  \0\0\0\n\0\0\0\274"..., 1024) = 1024
    lseek(31, 8192, SEEK_SET)               = 8192
    read(31, "\r\3-\0\4\0\316\0\3^\1X\0035\0\316\0\245\0v\0v\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
    lseek(31, 9216, SEEK_SET)               = 9216
    read(31, "\r\0\0\0\4\0\370\0\0\370\3d\0035\1!\0\277\0\277\2\36\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
    lseek(31, 14336, SEEK_SET)              = 14336
    read(31, "\r\0\0\0\4\0\320\0\3\315\1\313\1\244\0\320\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
    fcntl(31, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
    write(3, "00001299 0001 6169 execute retur"..., 76) = 76
    write(3, "00001299 0001 6170 execute('prag"..., 53) = 53
    write(3, "Executing 'pragma synchronous=of"..., 35) = 35
    lseek(31, 0, SEEK_CUR)                  = 15360
    close(31)                               = 0
    write(3, "self: <SqliteDb @0x7fa0feef9fd0 "..., 86) = 86
    write(3, "00001299 0001 6170 execute retur"..., 76) = 76
    write(3, "00001299 0001 6168 _connect retu"..., 40) = 40
    write(3, "00001299 0001 6167 __enter__ ret"..., 116) = 116
    write(3, "00001299 0001 6171 execute('sele"..., 82) = 82
    write(3, "Executing 'select tracer from tr"..., 68) = 68
    write(3, "self: <SqliteDb @0x7fa0feef9fd0 "..., 86) = 86
    fcntl(31, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = -1 EBADF (Bad file descriptor)
    open("/home/travis/apprise-api/.coverage", O_RDONLY|O_CLOEXEC) = 31
    fstat(31, {st_mode=S_IFREG|0644, st_size=192512, ...}) = 0
    ioctl(31, TCGETS, 0x7fff3bfa4920)       = -1 ENOTTY (Inappropriate ioctl for device)
    lseek(31, 0, SEEK_CUR)                  = 0
    read(31, "SQLite format 3\0\4\0\1\1\0@  \0\0\0\n\0\0\0\274"..., 4096) = 4096
    close(31)                               = 0
    write(3, "EXCEPTION from execute: disk I/O"..., 39) = 39
    write(3, "self: <SqliteDb @0x7fa0feef9fd0 "..., 86) = 86
    write(3, "00001299 0001 6172 __exit__(<cla"..., 208) = 208
    write(3, "00001299 0001 6173 close()\n", 27) = 27
    fstat(31, 0x7fff3bfa4220)               = -1 EBADF (Bad file descriptor)
    close(31)                               = -1 EBADF (Bad file descriptor)
Notice the close(31) right after the lseek. This close(31) is not generated by SQLite; rather, it's being generated by the GC finalization of an unrelated file object:

    #0  internal_close (self=<optimized out>) at ./Modules/_io/fileio.c:126
    #1  _io_FileIO_close_impl (self=<optimized out>) at ./Modules/_io/fileio.c:171
    #2  _io_FileIO_close (self=<optimized out>, _unused_ignored=<optimized out>) at ./Modules/_io/clinic/fileio.c.h:23
    #3  0x00007ffff7975be7 in _PyCFunction_FastCallDict (func_obj=func_obj@entry=0x7fffeefaee58, args=args@entry=0x7fffffff7590, nargs=0, kwargs=kwargs@entry=0x0) at Objects/methodobject.c:192
    #4  0x00007ffff791b89e in _PyObject_FastCallDict (func=func@entry=0x7fffeefaee58, args=args@entry=0x7fffffff7590, nargs=<optimized out>, kwargs=kwargs@entry=0x0) at Objects/abstract.c:2313
    #5  0x00007ffff791c8a4 in PyObject_CallMethodObjArgs (callable=0x7fffeefaee58, name=<optimized out>) at Objects/abstract.c:2759
    #6  0x00007ffff7a976b4 in buffered_close (self=0x7fffee02ae08, args=<optimized out>) at ./Modules/_io/bufferedio.c:538
    #7  0x00007ffff7975be7 in _PyCFunction_FastCallDict (func_obj=func_obj@entry=0x7fffedf7eab0, args=args@entry=0x7fffffff77c0, nargs=0, kwargs=kwargs@entry=0x0) at Objects/methodobject.c:192
    #8  0x00007ffff791b89e in _PyObject_FastCallDict (func=func@entry=0x7fffedf7eab0, args=args@entry=0x7fffffff77c0, nargs=<optimized out>, kwargs=kwargs@entry=0x0) at Objects/abstract.c:2313
    #9  0x00007ffff791c8a4 in PyObject_CallMethodObjArgs (callable=0x7fffedf7eab0, callable@entry=0x7fffee02ae08, name=<optimized out>) at Objects/abstract.c:2759
    #10 0x00007ffff7a90def in iobase_finalize (self=0x7fffee02ae08) at ./Modules/_io/iobase.c:266
    #11 0x00007ffff7a59e4f in finalize_garbage (collectable=0x7fffffff7950) at Modules/gcmodule.c:806
    #12 collect (generation=generation@entry=1, n_collected=n_collected@entry=0x7fffffff79f0, n_uncollectable=n_uncollectable@entry=0x7fffffff79f8, nofail=nofail@entry=0) at Modules/gcmodule.c:1005
    #13 0x00007ffff7a5a63b in collect_with_callback (generation=1) at Modules/gcmodule.c:1128
    #14 0x00007ffff7a5ae8b in collect_generations () at Modules/gcmodule.c:1151
    #15 _PyObject_GC_Alloc (basicsize=<optimized out>, use_calloc=0) at Modules/gcmodule.c:1729
    #16 _PyObject_GC_Malloc (basicsize=<optimized out>) at Modules/gcmodule.c:1739
    #17 0x00007ffff7a5afdd in _PyObject_GC_New (tp=tp@entry=0x7ffff7d53ea0 <PyGen_Type>) at Modules/gcmodule.c:1751
    #18 0x00007ffff794766f in gen_new_with_qualname (qualname=0x7ffff4c1b450, name=0x7ffff4c218f0, f=0x7ffff292daf8, type=0x7ffff7d53ea0 <PyGen_Type>) at Objects/genobject.c:802
    #19 PyGen_NewWithQualName (f=0x7ffff292daf8, name=0x7ffff4c218f0, qualname=0x7ffff4c1b450) at Objects/genobject.c:830
    #20 0x00007ffff7a07f31 in _PyEval_EvalCodeWithName (_co=0x7ffff4c23270, globals=globals@entry=0x7ffff4cdd9d8, locals=locals@entry=0x0, args=<optimized out>, argcount=1, kwnames=kwnames@entry=0x0, kwargs=0x7ffff4c32930, 
        kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fffede2cfd0, name=0x7ffff4c218f0, qualname=0x7ffff4c1b450) at Python/ceval.c:4150
    #21 0x00007ffff7a081d7 in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7fffedc03f28) at Python/ceval.c:4978
    #22 call_function (pp_stack=pp_stack@entry=0x7fffffff7c60, oparg=oparg@entry=1, kwnames=kwnames@entry=0x0) at Python/ceval.c:4858
    #23 0x00007ffff7a0a633 in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3335
Note that normal SQLite3 closes look like this:

    #0  0x00007ffff769b560 in __close_nocancel () from /lib/x86_64-linux-gnu/libpthread.so.0
    #1  0x00007ffff3a23d8c in robust_close (lineno=29378, h=<optimized out>, pFile=0x113cfd8) at sqlite3.c:28631
    #2  closeUnixFile (id=id@entry=0x113cfd8) at sqlite3.c:29378
    #3  0x00007ffff3a24f43 in unixClose (id=0x113cfd8) at sqlite3.c:29427
    #4  0x00007ffff3a4858b in sqlite3OsClose (pId=0x113cfd8) at sqlite3.c:17944
    #5  sqlite3PagerClose (pPager=0x113ce68) at sqlite3.c:47961
    #6  0x00007ffff3a5930b in sqlite3BtreeClose (p=0xfb8088) at sqlite3.c:58266
    #7  0x00007ffff3a59501 in sqlite3LeaveMutexAndCloseZombie (db=0x1260598) at sqlite3.c:134102
    #8  0x00007ffff3a59ae2 in sqlite3Close (db=0x1260598, forceZombie=0) at sqlite3.c:134045
    #9  0x00007ffff3cc8fe2 in pysqlite_connection_close (self=0x7fffedf5fd50, args=<optimized out>) at /tmp/python-build.20181021062245.3423/Python-3.6.7/Modules/_sqlite/connection.c:337
    #10 0x00007ffff7975be7 in _PyCFunction_FastCallDict (func_obj=0x7fffedf7eab0, args=0x7fffede2e890, nargs=0, kwargs=kwargs@entry=0x0) at Objects/methodobject.c:192
    #11 0x00007ffff7975eb7 in _PyCFunction_FastCallKeywords (func=func@entry=0x7fffedf7eab0, stack=stack@entry=0x7fffede2e890, nargs=<optimized out>, kwnames=kwnames@entry=0x0) at Objects/methodobject.c:294
    #12 0x00007ffff7a082c1 in call_function (pp_stack=pp_stack@entry=0x7fffffff9640, oparg=oparg@entry=0, kwnames=kwnames@entry=0x0) at Python/ceval.c:4837
EDIT 1: OK, I found the line of code that allocates this file, at line 398 in apprise_api/api/views.py:

             with NamedTemporaryFile() as f:
                # Write our content to disk
                f.write(config.encode())
                f.flush()
If you disable this block of code everything runs fine (although the tests fail). It's kinda creepy that the NamedTemporaryFile is leaking - it really shouldn't...



OK, I found the root cause.

The test harness uses mock to cause NamedTemporaryFile to fail in a few places:

        with patch('tempfile._TemporaryFileWrapper') as mock_ntf:
            mock_ntf.side_effect = OSError()
Unfortunately tempfile.NamedTemporaryFile is not properly set up to detect failures from tempfile._TemporaryFileWrapper:

    try:
        file = _io.open(fd, mode, buffering=buffering,
                        newline=newline, encoding=encoding)
        return _TemporaryFileWrapper(file, name, delete)
    except BaseException:
        _os.unlink(name)
        _os.close(fd)
        raise
Notice how `file` itself isn't cleaned up in the event that _TemporaryFileWrapper throws; instead, the underlying `fd` is closed directly. If `file` was actually opened correctly, it will now dangle, and some time later `file` will get GCed, attempt to close its fd, and fireworks ensue.

I don't know who's fault this ultimately is, but it does illustrate a possible danger of mock testing when messing with the internals of other libraries.


That would do it. Calling _os.close() on an fd that is now owned by a file object is a big no-no.

Both pieces of code are broken. "except BaseException" is wrong because it'll catch KeyboardInterrupt and such, after the file object exists. The mock is wrong because it's introducing an exception that the original code can never raise and was written to assume wouldn't be raised, thus turning a corner case rare bug (fd reuse when a user ^Cs a program) into a bigger problem.

The tempfile code should change to only catch Exceptions, not BaseExceptions; it may opt to additionally catch all BaseExceptions and only unlink the name, but that's probably not sufficient anyway since a KeyboardInterrupt may be raised before the whole try block. Really, the unlink should happen in a wider scope try block, and the _os.close should be more tightly constrained to the _io.open call. And the exception matching should be narrower, because it's safer to leak an fd than to accidentally close it twice.

Then either the test harness should find some other way to implement that, or they can ask the tempfile people to move the _TemporaryFileWrapper out of the try block so they can keep using that hook point, if it hasn't been already by that point.


It's still like that in CPython: https://github.com/python/cpython/blob/master/Lib/tempfile.p...

If there's a strong enough case that this is broken (and the KeyboardInterrupt example is a good one), I think we should file a bug about it.

EDIT: Created issue39318: https://bugs.python.org/issue39318. I think the possibility of this failing due to a KeyboardInterrupt, even if remote, is sufficient to consider this a bug (and with KeyboardInterrupt, it would be extremely subtle and hard to reproduce - more reason to get it fixed).


Impressive debugging expertise!

relevant history:

https://bugs.python.org/issue21058

https://bugs.python.org/issue26385


Wacky: the patch given in the initial submission of issue21058 is correct, but the patch that was actually committed incorrectly placed the wrapper inside the try block.


Interesting, that bug/patch was written by a guy I worked with at weta. He was always pushing code past its breaking point. I remember reading this bug before.

Looks like his fix didn’t have this bug. It may have been modified later, his patch was from 2014.


Very nice work.

> I don't know who's fault this ultimately is, but it does illustrate a possible danger of mock testing when messing with the internals of other libraries.

It doesn't look like there's any way that _TemporaryFileWrapper() could throw an OSError. I'd consider that mock a bug.


It's also why you keep your try/catch blocks as small as possible. In this case, if the return was outside, which it probably should've been, then the error would've bubbled up instead of causing I/O failure.


Here's Ned's blog post describing the solution: https://nedbatchelder.com/blog/202001/bug_915_solved.html


Wow, awesome find.

macOS and probably iOS has an undocumented set of filesystem calls which require a "cookie." You call open() and provide a large integer of your choice (probably random); to close the file you need to provide the same integer. This fixed a lot of crashes in CoreData, where some rando was closing its SQLite fd.


> macOS and probably iOS has an undocumented set of filesystem calls which require a "cookie." You call open() and provide a large integer of your choice (probably random); to close the file you need to provide the same integer.

I want to learn more. Is there anything I can read about this? (I tried searching but I just keep on getting stuff about HTTP cookies not this.)


guarded_open_np guarded_close_np change_fdguard_np

https://opensource.apple.com/source/xnu/xnu-4903.241.1/bsd/k...


I too am very interested in this, so tried searches with "macos", then "mach" and "xnu".

This led me to tentatively conclude that it's not in XNU.

On the one hand, http://newosxbook.com/src.jl?tree=xnu-1504.15.3&file=/bsd/sy... seems interesting.

On the other hand, https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db0128... is suggested to be a red herring because of https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db0128.... I have no idea if these are connected but they might be.

I'm not sure if my GitHub and/or Google queries aren't finding the relevant bits of magic though; had (only) a couple minutes and got curious.



No, those are libc functions that let you use stdio (fprintf, fwrite, fscanf, etc.) with arbitrary "file" implementations. glibc has these too, but these have nothing to do with the underlying syscalls.


Then the problem isn't the fd leaking, the problem is that the fd is being closed twice. If SQLite owns that fd then it's obviously no longer owned by the NamedTemporaryFile, so something closed it. So either NamedTemporaryFile is closing an underlying fd twice (that sounds like a horrible Python bug, it should never be allowed) or some piece of code somewhere did one of those horrible "close all fds from 3 to 1024" things at one point, and SQLite then reused an fd that was still assigned to a live File object.


Very curious about this wonderful magic. How were you able to get the call stack from the strace line?


I basically set up my GDB with commands to stop on a specific pattern of "lseek, then close", and if the pattern isn't met it just automatically continues the program.

This is what the gdb script looks like:

    set height 0

    catch syscall close
    catch syscall read
    catch syscall lseek
    disable 1 2

    commands 2
     disable 1 2
     continue
    end

    commands 3
     if $rdi == 31
      enable 1 2
      continue
     else
      continue
     end
    end
The lseek catchpoint (3) enables both read and close catchpoints; if the read catchpoint (2) is hit first it disables both and continues. This way we look for lseek followed by close without intervening reads.

It generates a few false positives but otherwise fairly quickly stops on the right syscall, at which point I could backtrace and prod the live program.


This is fantastic. I thought I was pretty good at gdb, but this is next level (and it isn't complicated).

I've never seen any resource on "how to gdb a really gnarly bug"; this is a great example.


Debuggers are great! They're printf debugging, except that your printf can respond to the program state almost arbitrarily at runtime; not only that, they can modify the program state and can do so from a lot more lenient environment than your compiler would usually allow you to do. An advanced user can end up doing more programming in the debugger than outside of it.


Glad to see people writing in C-like languages finally get a taste of lisp programs ;)


This is very nice and interesting. Does that also work for lldb? Would sth like this also be possible with dtrace?

Can you recommend any good resources to learn more about this? Should I just study the LLDB/GDB references? Or maybe if you think there are not much good resources on this, maybe it would be a good idea if you could write a blog post about this?


> Does that also work for lldb?

LLDB doesn't appear to have a way to set a breakpoint at syscalls, unfortunately. If it's dynamically linked, however, it will probably go through your platform's libc so you can probably achieve the same effect by breaking on all of the syscall wrapper functions.


I am also interested in this. So far this video [0] from CPPCon'18 is the most complete and fresh I managed to find. Other than that, GDB manual is okay (LLDB's is not complete, as seems to be the case with most LLVM-based tools).

[0] <https://youtu.be/V1t6faOKjuQ>


and you can do it in a rr trace on linux I think if its hard to reproduce which sometimes happen with similar programs


... wow. For other people, RR is a debugging tool for recording and playing back program execution. During the replay phase, you get great gdb debugger support, including running code backwards to find bugs.

It's not a toy: it can debug Google Chrome, Libre Office, and QEMU. However, RR is Linux-only.

Thanks alehander42!

Ref: https://en.wikipedia.org/wiki/Rr_(debugging)


Would https://www.python.org/dev/peps/pep-0446/ have anything to do with this?

We recently upgraded to celery 4 which had a side effect of file descriptors leaking between master and child processes, and database fds were being used for redis. Still no idea how it's happening, but the linked PEP seems to suggest there may be issues that I don't quite understand.


> If you disable this block of code everything runs fine (although the tests fail). It's kinda creepy that the NamedTemporaryFile is leaking - it really shouldn't...

My guess is the tempfile is leaking due to the attribute caching here https://github.com/python/cpython/blob/master/Lib/tempfile.p... . But that doesn't really explain why the file is getting closed twice. (Presuming it's getting closed twice as otherwise the file descriptor wouldn't be reallocated)

Edit: realised it's just attaching the closer object to the function not the full object so avoiding this.


This whole wrapper nonsense to get with working with NamedTemporaryFile seems really suspect in general.

My gut instinct would be to rewrite the section OP found as a standard NamedTemporaryFile instantiation, then close it manually. The __getattr__ hack feels like a real code smell.


Nice find. Is there a good way to find out which line of code does a particular system call in the strace log? E.g. how did you find out that the close(31) is caused by the GC of the mock?


See my answer above (https://news.ycombinator.com/item?id=22031581) - the gist is that I set up GDB to break on a syscall matching the pattern of the offending close(31) (without also breaking on the zillions of legitimate close(31) calls generated by SQLite). Then I backtraced at that point and observed that this particular close wasn't from SQLite (which I'd already suspected), and the backtrace had several lines indicating that the close was from the GC finalization of a FileIO object.

The second step was to figure out which FileIO object was being GCed. I ran the program again and broke inside the FileIO __init__ function (which is written in C), conditioned on the fd being 31. From that breakpoint I did a Python backtrace (using the tools from https://github.com/python/cpython/blob/master/Misc/gdbinit) to figure out the Python code that was responsible for the allocation.

The third step was futzing around with the NamedTemporaryFile code until I could determine why it was leaking a file. This took a little while because I did not initially realize that it had been mocked. Once I did, the solution presented itself.


I suppose adding f.close() right after the flush might help.


The file is in a with block - it should be automatically closed at the end of the block.


Yes, but if it fixes the issue then there's something off with NamedTemporaryFile and its __exit__. You mentioned the file descriptor was being closed by a GC run, right?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: