BigQuery: ensure that `KeyboardInterrupt` during `to_dataframe`no longer hangs. #7698

tswast · 2019-04-12T00:26:49Z

I noticed in manually testing to_dataframe that it would stop
the current cell when I hit Ctrl-C, but data kept on downloading in the
background. Trying to exit the Python shell, I'd notice that it would
hang until I pressed Ctrl-C a few more times.

Rather than get the DataFrame for each stream in one big chunk, loop
through each block and exit if the function needs to quit early. This
follows the pattern at https://stackoverflow.com/a/29237343/101923

This depends on:

~~Add page iterator to ReadRowsStream #7680, new .pages feature in BQ Storage client~~
BigQuery: Raise ValueError when BQ Storage is required but missing #7726, aligns optional import for BQ Storage client with pattern used elsewhere in the client.

bigquery/tests/unit/test_table.py

tswast · 2019-04-16T20:49:07Z

✅ Note to self: Need to update the minimum google-cloud-bigquery-storage dependency version after the release at #7716 is complete.

shollyman · 2019-04-19T17:28:08Z

bigquery/google/cloud/bigquery/table.py

-            return rowstream.to_dataframe(session, dtypes=dtypes)
+        # Use _to_dataframe_finished to notify worker threads when to quit.
+        # See: https://stackoverflow.com/a/29237343/101923
+        self._to_dataframe_finished = False


Mostly I'm trying to reason about scope here. This is a worker pool, but is it possible we're generating multiple independent dataframes that would share the same access to _to_dataframe_finished? Do we need to key the workers to specific invocations, or is the nature of the access always blocking so that this isn't an issue?

There are a couple reasons that I don't think it's an issue:

Yes, to_dataframe() is a blocking call, so you wouldn't really have multiple going at once.

RowIterator isn't really something you'd want to use across threads or even more than once anyway. Because of how the pagination state works, once you loop through all the rows once, to_dataframe() returns an empty DataFrame, even in the case where BQ Storage API isn't used.

…no longer hangs I noticed in manually testing `to_dataframe` that it would stop the current cell when I hit Ctrl-C, but data kept on downloading in the background. Trying to exit the Python shell, I'd notice that it would hang until I pressed Ctrl-C a few more times. Rather than get the DataFrame for each stream in one big chunk, loop through each block and exit if the function needs to quit early. This follows the pattern at https://stackoverflow.com/a/29237343/101923 Update tests to ensure multiple progress interval loops.

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Apr 12, 2019

tswast mentioned this pull request Apr 12, 2019

BigQuery: to_dataframe respects progress_bar_type when used with BQ Storage API #7697

Merged

tswast commented Apr 12, 2019

View reviewed changes

bigquery/tests/unit/test_table.py Outdated Show resolved Hide resolved

tswast force-pushed the bqstorage-exit-early branch from 0971781 to 6fc4af2 Compare April 17, 2019 16:17

tswast marked this pull request as ready for review April 17, 2019 16:58

tswast requested a review from crwilcox as a code owner April 17, 2019 16:58

tswast requested review from a team and shollyman April 17, 2019 16:58

tswast force-pushed the bqstorage-exit-early branch from fa7bb05 to 9016469 Compare April 17, 2019 22:10

tswast requested a review from plamut April 17, 2019 22:18

tseaver changed the title ~~fix: KeyboardInterrupt during to_dataframe (with BQ Storage API) no longer hangs~~ Apr 17, 2019

tswast added api: bigquery Issues related to the BigQuery API. api: bigquerystorage Issues related to the BigQuery Storage API. labels Apr 17, 2019

tswast mentioned this pull request Apr 18, 2019

BigQuery: Raise ValueError when BQ Storage is required but missing #7726

Merged

tswast force-pushed the bqstorage-exit-early branch 3 times, most recently from 8593692 to 44186b6 Compare April 18, 2019 00:16

tseaver changed the title ~~BigQuery Storage: ensure that KeyboardInterrupt during to_dataframeno longer hangs.~~ Apr 18, 2019

yoshi-automation added the 🚨 This issue needs some love. label Apr 19, 2019

shollyman reviewed Apr 19, 2019

View reviewed changes

tswast added 2 commits April 19, 2019 10:40

Refactor _to_dataframe_bqstorage_stream

bf73284

tswast force-pushed the bqstorage-exit-early branch from 44186b6 to bf73284 Compare April 19, 2019 17:41

shollyman approved these changes Apr 22, 2019

View reviewed changes

tswast merged commit ee804a1 into googleapis:master Apr 22, 2019

tswast deleted the bqstorage-exit-early branch April 22, 2019 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: ensure that `KeyboardInterrupt` during `to_dataframe`no longer hangs. #7698

BigQuery: ensure that `KeyboardInterrupt` during `to_dataframe`no longer hangs. #7698

tswast commented Apr 12, 2019 •

edited

Loading

tswast commented Apr 16, 2019 •

edited

Loading

shollyman Apr 19, 2019

tswast Apr 19, 2019

BigQuery: ensure that KeyboardInterrupt during to_dataframeno longer hangs. #7698

BigQuery: ensure that KeyboardInterrupt during to_dataframeno longer hangs. #7698

Conversation

tswast commented Apr 12, 2019 • edited Loading

tswast commented Apr 16, 2019 • edited Loading

shollyman Apr 19, 2019

Choose a reason for hiding this comment

tswast Apr 19, 2019

Choose a reason for hiding this comment

BigQuery: ensure that `KeyboardInterrupt` during `to_dataframe`no longer hangs. #7698

BigQuery: ensure that `KeyboardInterrupt` during `to_dataframe`no longer hangs. #7698

tswast commented Apr 12, 2019 •

edited

Loading

tswast commented Apr 16, 2019 •

edited

Loading