Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on reset for nextBatchID by assert in FSTLevelDBMutationQueue.start #2237

Closed
shosokawa opened this issue Jan 7, 2019 · 4 comments
Closed
Assignees
Milestone

Comments

@shosokawa
Copy link

Describe your environment

  • Xcode version: Version 10.1 (10B61)
  • Firebase SDK version: 5.15.0
  • Firebase Component: Firestore
  • Component version: 0.16.1

Describe the problem

An app on a certain device was crashed by HARD_ASSERT in this code repeatedly. I couldn't find a way to handle or recover this error except uninstalling the app once to delete persistent data.

metadata.lastAcknowledgedBatchId = kFSTBatchIDUnknown;
} else {
BatchId lastAcked = metadata.lastAcknowledgedBatchId;
if (lastAcked >= nextBatchID) {
HARD_ASSERT([self isEmpty], "Reset nextBatchID is only possible when the queue is empty");
lastAcked = kFSTBatchIDUnknown;
metadata.lastAcknowledgedBatchId = lastAcked;
_db.currentTransaction->Put([self keyForCurrentMutationQueue], metadata);
}
}

I'm not sure why data was stored like that. Although I suppose that there might be chances as requirement to reset nextBatchID with isEmpty == false.

Steps to reproduce:

I couldn't find a way to create same situation.

Stack trace for the crash:

Fatal Exception: NSInternalInconsistencyException
0  CoreFoundation                 0x196a90ec4 __exceptionPreprocess
1  libobjc.A.dylib                0x195c61a40 objc_exception_throw
2  CoreFoundation                 0x1969a6b3c +[_CFXNotificationTokenRegistration keyCallbacks]
3  Foundation                     0x1974e3980 -[NSAssertionHandler handleFailureInFunction:file:lineNumber:description:]
4  Alpaca                         0x104a548d0 firebase::firestore::util::internal::Fail(char const*, char const*, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) (hard_assert_apple.mm:34)
5  Alpaca                         0x104a549e0 firebase::firestore::util::internal::Fail(char const*, char const*, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, char const*) (hard_assert_apple.mm:54)
6  Alpaca                         0x1049f370c -[FSTLevelDBMutationQueue start] (FSTLevelDBMutationQueue.mm:133)
7  Alpaca                         0x104a01320 -[FSTLocalStore startMutationQueue] (FSTLocalStore.mm:124)
8  Alpaca                         0x104a011e4 -[FSTLocalStore start] (FSTLocalStore.mm:119)
9  Alpaca                         0x1049e7720 -[FSTFirestoreClient initializeWithUser:settings:] (FSTFirestoreClient.mm:245)
10 Alpaca                         0x1049e97ec std::__1::__function::__func<-[FSTFirestoreClient initWithDatabaseInfo:settings:credentialsProvider:userExecutor:workerQueue:]::$_1, std::__1::allocator<-[FSTFirestoreClient initWithDatabaseInfo:settings:credentialsProvider:userExecutor:workerQueue:]::$_1>, void ()>::operator()() (string:1255)
11 Alpaca                         0x1049b7850 firebase::firestore::util::AsyncQueue::ExecuteBlocking(std::__1::function<void ()> const&) (atomic:921)
12 Alpaca                         0x1049bef0c firebase::firestore::util::internal::DispatchAsync(NSObject<OS_dispatch_queue>*, std::__1::function<void ()>&&)::$_0::__invoke(void*) (executor_libdispatch.mm:58)
13 libdispatch.dylib              0x1964ca484 _dispatch_client_callout
14 libdispatch.dylib              0x196471bd0 _dispatch_lane_serial_drain$VARIANT$mp
15 libdispatch.dylib              0x196472718 _dispatch_lane_invoke$VARIANT$mp
16 libdispatch.dylib              0x19647aeb8 _dispatch_workloop_worker_thread
17 libsystem_pthread.dylib        0x1966ad0dc _pthread_wqthread
18 libsystem_pthread.dylib        0x1966afcec start_wqthread

screen shot 2019-01-07 at 12 17 11

@wilhuff
Copy link
Contributor

wilhuff commented Jan 7, 2019

So unfortunately, the stack trace for this crash isn't informative: if this assertion were to fail it can only fail during startup exactly in the way it has.

Do you still have a device that's failing this way? If so we may be able to discover what's happening by examining the stored persistent data.

The failure here indicates that there's a mismatch between the mutation queue contents and the metadata about the queue. These should always be in sync (they're modified together transactionally) and if they're not in sync it points to a serious error somewhere.

Have you changed Firestore SDK versions recently?

@shosokawa
Copy link
Author

Do you still have a device that's failing this way?

I'm sorry, I already have made the device recovered and I don't know a way to reproduce that.

Have you changed Firestore SDK versions recently?

We fixed version of Firestore SDK to 5.12.0 for a while in order to prevent a problem around #2138. It seems that #2138 had been resolved, we updated version to 5.15.0 recently. Before fixing version to 5.12.0, we used 5.13.0 once.

@wilhuff
Copy link
Contributor

wilhuff commented Jan 9, 2019

OK. Looking through the changes around those versions I think I've figured out you could get in this state.

In #2029 (released with Firebase 5.13.0) we changed the way we store writes and documents that have been acknowledged by the server but not yet observed via an active listen. Previously we would hold these in our outbound queue and keep track in memory of these held write acknowledgements. On restart these held writes would be dropped because the on-disk structures were insufficient to reconstruct the in-memory tracking structures. As a result, prior to Firebase 5.13.0, it was possible to observe flicker across restarts, where a value would fleetingly revert to a state prior to a local write if the app happened to stop in this in-between state.

#2029 fixed this flicker (and made a bunch of other internal improvements possible) by storing the SDK's estimate of the resulting document in the cache and removing acknowledged writes immediately. A one-time schema migration was included that performed a final delete of any existing held write acks on restart.

What I think is happening in your case is that while the schema migration deletes acknowledged writes it doesn't reset the highest acknowledged batch id if all writes were deleted. This in turn causes the violation of the invariant you're observing.

The good news is that this in-between state does not exist for very long and to get in this state a user would have to shut down your app just as the last write the queue was acknowledged but not yet observed. That should make this bug rare.

I need to get back with the team to evaluate possible fixes. Specifically I'm not yet convinced that it's safe to just remove the assertion that your device tripped over.

@wilhuff wilhuff self-assigned this Jan 9, 2019
@wilhuff wilhuff added this to the M41 milestone Jan 9, 2019
@wilhuff wilhuff assigned var-const and unassigned wilhuff Jan 9, 2019
@var-const
Copy link
Contributor

var-const commented Jan 11, 2019

PR #2243, which will be part of the next release, fixes this.

@firebase firebase locked and limited conversation to collaborators Oct 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.