Eliminate lock c->dirtylk which is the cause of fossil hanging while taking a snapshot, as mentioned in BUGS of fossilcons(8). Dirtylk prevents any thread from dirtying new blocks while flushThread is writing old dirty blocks to disk. Deadlock occurs when a thread T tries to dirty a block B while holding a lock on an already-dirty block A. T can't proceed until c->dirtylk is unlocked, which won't happen until flushThread has written out block A; but block A can't be written out until T unlocks it. Dirtylk serves no useful purpose: whenever cacheFlush is called with wait=1, a write lock is already held on the "epoch lock" fs->elk, which prevents any new 9p requests from touching the file system. Any dirtying of blocks is limited to internal fossil activities for requests which were already in progress when cacheFlush was called, and thus strictly bounded even without c->dirtylk. Reference: /n/sources/patch/applied/fossil-snap-deadlock Date: Thu Mar 22 12:55:40 CET 2012 Signed-off-by: miller@hamnavoe.com --- /sys/src/cmd/fossil/cache.c Thu Mar 22 12:55:24 2012 +++ /sys/src/cmd/fossil/cache.c Thu Mar 22 13:09:24 2012 @@ -25,7 +25,6 @@ struct Cache { VtLock *lk; - VtLock *dirtylk; int ref; int mode; @@ -163,7 +162,6 @@ nbl = nblocks * 4; c->lk = vtLockAlloc(); - c->dirtylk = vtLockAlloc(); /* allowed to dirty blocks */ c->ref = 1; c->disk = disk; c->z = z; @@ -1100,14 +1098,12 @@ return 1; assert(b->iostate == BioClean); - vtLock(c->dirtylk); vtLock(c->lk); b->iostate = BioDirty; c->ndirty++; if(c->ndirty > (c->maxdirty>>1)) vtWakeup(c->flush); vtUnlock(c->lk); - vtUnlock(c->dirtylk); return 1; } @@ -2088,12 +2084,6 @@ void cacheFlush(Cache *c, int wait) { - /* - * Lock c->dirtylk so that more blocks aren't being dirtied - * while we try to write out what's already here. - * Otherwise we might not ever finish! - */ - vtLock(c->dirtylk); vtLock(c->lk); if(wait){ while(c->ndirty){ @@ -2106,7 +2096,6 @@ }else if(c->ndirty) vtWakeup(c->flush); vtUnlock(c->lk); - vtUnlock(c->dirtylk); } /* --- /sys/man/8/fossilcons Thu Mar 22 12:55:29 2012 +++ /sys/man/8/fossilcons Thu Mar 22 12:55:26 2012 @@ -1205,13 +1205,3 @@ .EX snap -a -s /snapshot/2003/1220/0700 -d /archive/2003/1220 .EE -.SH BUGS -It is prudent to avoid taking a snapshot at the same time as an -archival dump. -.I Fossil -has been seen to sometimes hang when they collide. -Snapshots are taken when -.BI time(0)/60% interval -is zero, so -an interval of 60 will take snapshots on the hour. -It's easiest to schedule the archival dumps to happen not exactly on the hour.