Modify

Ticket #4345 (closed Bugs: fixed)

Opened 4 years ago

Last modified 22 months ago

thread::id and joining problem with cascade of threads

Reported by: bartek szurgot <bartosz.szurgot@…> Owned by: viboes
Milestone: Boost 1.50.0 Component: thread
Version: Boost 1.43.0 Severity: Problem
Keywords: thread::id Cc:

Description

i believe there is a problem when saving thread's ids in an external (for a thread) collection, when joining cascade of threads and collection lives on.

example code, showing this issue has been attached this this report. what happens there:

  1. main() creates collection of thread::id objects (i.e. Data)
  2. main() creates two threads (in classes A and B)
  3. ThreadJoiner? helper class interrupts and joins threads in d-tor, and so main() int+join thread in B class' instance, that in turn int+join A class' instance thread (member of B object)
  4. check if threads exited, as they should.

each thread increments global counter when entering operator()() and decrements it when leaving operator()()'s scope, so when leaving scope of ThreadJoiner? root object all threads should be already joined. it happens this way (as expected) as long as one does not save thread::ids. if they are saved, and live longer then thread, threads are not (always) joined properly (in attached example program: assertion on global counter equals 0 fails). it looks like race, but is reproducible in ~95% of the cases.

problem does not seem to appear when threads are run directly, and joined from main as well (not cascade).

in example code there are 2 ways to make code work as expected:

  1. uncomment main.cpp:139 (deallocation of collection of saved IDs before calling joins in d-tors)
  2. change main.cpp:32 from '#if 1' to '#if 0' - this changes collection to hold std::string (generated via stream, from boost::thread::id) from boost::thread::id

i was able to reproduce this issue with the same results with boost versions:

  1. 1.38
  2. 1.40
  3. 1.43
  4. today's trunk

toolchain is g++ 4.4.x under Linux (Debian and Ubuntu).

to build and run example code extract bzip2 archive, enter directory and run: make && gen/debug/a.out

Attachments

thread_bug_test.tar.bz2 Download (2.8 KB) - added by bartek szurgot <bartosz.szurgot@…> 4 years ago.
thread_bug_test.tar.2.bz2 Download (2.9 KB) - added by anonymous 3 years ago.
patch_thread_id.diff.bz2 Download (766 bytes) - added by anonymous 3 years ago.
patch_cond_warn.diff.bz2 Download (386 bytes) - added by anonymous 3 years ago.

Change History

Changed 4 years ago by bartek szurgot <bartosz.szurgot@…>

comment:1 follow-up: ↓ 3 Changed 4 years ago by anonymous

g_count is not protected against race conditions. You should use an atomic integer or use a mutext to protect g_count.

comment:2 follow-up: ↓ 4 Changed 4 years ago by steven_watanabe

This code is very hard to follow, but here's what's happening in detail:

  • An object of type B is created
    • This creates a shared_ptr with a new A object
      • The new A object starts thread A
  • Thread B is created, and receives a copy of the B object, this copy is not destroyed until the last thread::id for this thread is gone.
  • Thread B runs to completion and is joined. The last copy of B has not yet been destroyed, so there is no attempt to join thread A.
  • Kaboom.

I don't think this should be considered a bug in the library. If you want to join another thread when a thread exits, join at the end of the thread function instead of in the destructor.

comment:3 in reply to: ↑ 1 ; follow-up: ↓ 5 Changed 4 years ago by bartek szurgot <bartosz.szurgot@…>

Replying to anonymous:

g_count is not protected against race conditions. You should use an atomic integer or use a mutext to protect g_count.

it's true, what you say - the same goes for g_started. i left it this way intentionally since i work on x86 where ++x on integer is atomic any way and i didn't want to complicate example code any more.

comment:4 in reply to: ↑ 2 Changed 4 years ago by bartek szurgot <bartosz.szurgot@…>

Replying to steven_watanabe:

This code is very hard to follow, but here's what's happening in detail:

this is the easy case - trust me. i've spent last 2 days tracking it down in "real" system we're developing. ;)

  • An object of type B is created
    • This creates a shared_ptr with a new A object
      • The new A object starts thread A
  • Thread B is created, and receives a copy of the B object, this copy is not destroyed until the last thread::id for this thread is gone.

this is the key - thread::id holds identifier of a thread as well as handle to it. it means that thread::id has in fact double responsibility, thus non-obvious behavior (id!=handle).

  • Thread B runs to completion and is joined. The last copy of B has not yet been destroyed, so there is no attempt to join thread A.
  • Kaboom.

I don't think this should be considered a bug in the library. If you want to join another thread when a thread exits, join at the end of the thread function instead of in the destructor.

technically this can be done, though i don't think it's a proper way - by doing so one must remember in each thread's operator()()'s implementation to explicitly join all owned threads, instead of doing it in error-prone, RAII-way (d-tor perform clean up - here: int+join).

a took a look in thread::id implementation - it holds shared_ptr to "thread's data" (i assume this is a handle to thread itself). my suggestion is to change it to weak_ptr, so that thread::id would become only "observer" of the thread, that can be compared, put into stream, etc... this would eliminate double responsibility of thread::id. alternatively only data required to identify thread should be stored in thread::id - ex.: name string.

comment:5 in reply to: ↑ 3 ; follow-up: ↓ 7 Changed 4 years ago by anthonyw

Replying to bartek szurgot <bartosz.szurgot@…>:

Replying to anonymous:

g_count is not protected against race conditions. You should use an atomic integer or use a mutext to protect g_count.

it's true, what you say - the same goes for g_started. i left it this way intentionally since i work on x86 where ++x on integer is atomic any way and i didn't want to complicate example code any more.

++x is not atomic on integers, even on x86. See  http://www.devx.com/cplus/Article/42725

comment:6 Changed 4 years ago by anthonyw

I agree that the thread function should be destroyed when the thread is joined. I am not sure of the best way to handle this --- maybe weak_ptr is the way to go, maybe another way is preferable.

comment:7 in reply to: ↑ 5 Changed 4 years ago by bartek szurgot <bartosz.szurgot@…>

Replying to anthonyw:

Replying to bartek szurgot <bartosz.szurgot@…>:

Replying to anonymous:

g_count is not protected against race conditions. You should use an atomic integer or use a mutext to protect g_count.

it's true, what you say - the same goes for g_started. i left it this way intentionally since i work on x86 where ++x on integer is atomic any way and i didn't want to complicate example code any more.

++x is not atomic on integers, even on x86. See  http://www.devx.com/cplus/Article/42725

i've read linked article and run tests, but got different results. since this discussion is a bit out of scope of this ticket, please contact me directly (bartosz dot szurgot at pwr dot wroc dot pl) so we can proceed with this topic.

comment:8 Changed 3 years ago by viboes

  • Cc viboes added

Please, could you give a try to week_ptr and see how the solution you propose behaves?

Changed 3 years ago by anonymous

Changed 3 years ago by anonymous

Changed 3 years ago by anonymous

comment:9 Changed 3 years ago by bartosz.szurgot@…

i've tested it with boost from svn (revision 66677) and created patch that changes shared_prt<> to weak_ptr<> in thread::id. it appears to work (i.e. solves the problem). i've also attached changed version of test-program (locks for ++ and -- operations).

when doing so i've also found warning on condition variable (unused variable when no assert is present). i've made separate patch for this as well.

comment:10 Changed 2 years ago by viboes

  • Cc viboes removed
  • Owner changed from anthonyw to viboes
  • Status changed from new to assigned
  • Milestone changed from Boost 1.43.0 to To Be Determined

comment:11 Changed 2 years ago by viboes

I was wondering what prevent from making thread::id an opaque type around the thread::native_handle, Anthony?

comment:12 Changed 2 years ago by viboes

  • Keywords thread::id added

comment:13 Changed 2 years ago by viboes

See also #5173 Bug/design issue with boost::thread::id

Last edited 2 years ago by viboes (previous) (diff)

comment:14 Changed 2 years ago by viboes

I've attached a patch to #5173 that should solve this issue also. Please could you give it a try?

comment:15 Changed 2 years ago by viboes

  • Milestone changed from To Be Determined to Boost 1.50.0

Committed in trunk r77838

comment:16 Changed 2 years ago by bartek szurgot <bartek.szurgot@…>

i've checked r77838 - it appears that it does not fix the problem. using boost's trunk (r78191, as of writing these words) however appears to fix it, i.e. using provided test program it does join both threads, as it should.

comment:17 Changed 23 months ago by viboes

Have you defined BOOST_THREAD_PROVIDES_BASIC_THREAD_ID or BOOST_THREAD_VERSION=2?

comment:18 Changed 23 months ago by viboes

  • Status changed from assigned to closed
  • Resolution set to fixed

Committed in release branch at [78543]

comment:19 Changed 22 months ago by bartek.szurgot@…

when BOOST_THREAD_PROVIDES_BASIC_THREAD_ID is set in r77838 test app works fine. i see that on trunk (r79003 as of writing these words) is is enabled by default. this probably was the difference.

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
The resolution will be deleted. Next status will be 'reopened'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.