Opened 6 years ago

Last modified 6 years ago

#7606 new Bugs

u32regex causes bus error

Reported by: a.sanders@… Owned by: John Maddock
Milestone: To Be Determined Component: regex
Version: Boost 1.51.0 Severity: Problem
Keywords: Cc:

Description

The Unicode regular expression

boost::make_u32regex ("[pq]
.
.[xy]");

causes a Bus Error. The same r.e. as a boost::regex is fine.

The r.e. "[pq]..[xy]" seems to be okay, so it looks like the repeated "
." is at least part of the problem.

The following program Bus Errors on Solaris 10 with gcc 4.6.1 and Boost 1.51. (and all previous versions of Boost as far as I can tell.)


# include <boost/regex/icu.hpp>

int main (int, char) {

const boost::u32regex re = boost::make_u32regex ("[pq]
.
.[xy]"); return 0;

}


Attachments (5)

regex.cpp (142 bytes) - added by a.sanders@… 6 years ago.
regex-crash.cpp (142 bytes) - added by a.sanders@… 6 years ago.
Causes Bus Error when run
regex gdb trace.txt (4.8 KB) - added by Ashley Sanders <a.sanders@…> 6 years ago.
gdb stack trace
regex-gcc-test.txt.bz2 (7.2 KB) - added by Ashley Sanders <a.sanders@…> 6 years ago.
Output from first run of bjam toolset=gcc
regex-gcc-test-2.txt (20.1 KB) - added by Ashley Sanders <a.sanders@…> 6 years ago.
Output from second run of bjam toolset=gcc

Download all attachments as: .zip

Change History (19)

Changed 6 years ago by a.sanders@…

Attachment: regex.cpp added

comment:1 Changed 6 years ago by anonymous

Wiki formatting made a mess of the quoted program so have attached it.

Changed 6 years ago by a.sanders@…

Attachment: regex-crash.cpp added

Causes Bus Error when run

comment:2 Changed 6 years ago by a.sanders@…

Just noticed I attached the wrong version of the file before (a version that didn't cause a Bus Error.) So I have now attached the version that does cause the Bus Error. At least you now have both versions.

Apologies for any confusion.

comment:3 Changed 6 years ago by John Maddock

I'm unable to reproduce either on Win32 or ubuntu Linux with current SVN Trunk and ICU 49. Are you able to debug locally?

comment:4 Changed 6 years ago by Ashley Sanders <a.sanders@…>

Probably. What are you after? A stack trace? Anything else?

comment:5 Changed 6 years ago by Ashley Sanders <a.sanders@…>

Built a debug version of libboost_regex. I'll attach a stack trace from gdb. If you'd like me to investigate further please give me some pointers as to what to look for!

Changed 6 years ago by Ashley Sanders <a.sanders@…>

Attachment: regex gdb trace.txt added

gdb stack trace

comment:6 Changed 6 years ago by John Maddock

Thanks for the stack trace, unfortunately that makes even less sense now :-(

Can you set a breakpoint (basic_regex.cpp:399) inside:

   template <class InputIterator>
   basic_regex(InputIterator arg_first, InputIterator arg_last, flag_type f = regex_constants::normal)
   {
      typedef typename traits::string_type seq_type;
      seq_type a(arg_first, arg_last);
      if(a.size())
         assign(static_cast<const charT*>(&*a.begin()), static_cast<const charT*>(&*a.begin() + a.size()), f);
      else
         assign(static_cast<const charT*>(0), static_cast<const charT*>(0), f);
   }

What are the contents of "a" after construction?

Any chance that your code is compiled using a compiler code page that results in the input string not being valid ASCII/UTF8?

Thanks, John.

comment:7 Changed 6 years ago by Ashley Sanders <a.sanders@…>

Here you are. Not sure that this looks helpful. At the "if" statement following the "a" construction took the true branch.

I'm not setting any compiler code page anywhere.

Breakpoint 2 at 0x15738: file /export/home/ashley/src/boost_1_51_0/boost/regex/v4/basic_regex.hpp, line 399.
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /export/home/ashley/tmp/regex 
[Thread debugging using libthread_db enabled]
[New Thread 1 (LWP 1)]
[Switching to Thread 1 (LWP 1)]

Breakpoint 2, boost::basic_regex<int, boost::icu_regex_traits>::basic_regex<boost::u8_to_u32_iterator<char const*, int> > (
    this=0xffbffab0, arg_first=..., arg_last=..., f=0) at /export/home/ashley/src/boost_1_51_0/boost/regex/v4/basic_regex.hpp:400
400        {
(gdb) list
395           assign(p, f); 
396        }
397     
398        template <class InputIterator>
399        basic_regex(InputIterator arg_first, InputIterator arg_last, flag_type f = regex_constants::normal)
400        {
401           typedef typename traits::string_type seq_type;
402           seq_type a(arg_first, arg_last);
403           if(a.size())
404              assign(static_cast<const charT*>(&*a.begin()), static_cast<const charT*>(&*a.begin() + a.size()), f);
(gdb) n
402           seq_type a(arg_first, arg_last);
(gdb) n
403           if(a.size())
(gdb) print a
$1 = {<std::_Vector_base<int, std::allocator<int> >> = {
    _M_impl = {<std::allocator<int>> = {<__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>}, _M_start = 0x27cd0, 
      _M_finish = 0x27d00, _M_end_of_storage = 0x27d00}}, <No data fields>}

comment:8 Changed 6 years ago by Ashley Sanders <a.sanders@…>

What I meant to say was the "if(a.size())" evaluated true.

comment:9 Changed 6 years ago by anonymous

You're right, doesn't really help :-(

Try:

#include <boost/regex/icu.hpp>

template <class C>
void printout(const C& c)
{
   for(unsigned i = 0; i < c.size(); ++i)
      std::cout << std::hex << (int)c[i] << " ";
   std::cout << std::endl;
}

int main()
{
   using namespace boost;

   typedef u32regex::traits_type::string_type st;
   typedef boost::u8_to_u32_iterator<std::string::const_iterator, UChar32> conv_type;

   const std::string p = "[pq]\\.\\.[xy]";

   st t(conv_type(p.begin(), p.begin(), p.end()), conv_type(p.end(), p.begin(), p.end()));

   printout(p);
   printout(t);

   return 0;
}

Which should output:

5b 70 71 5d 5c 2e 5c 2e 5b 78 79 5d
5b 70 71 5d 5c 2e 5c 2e 5b 78 79 5d

Thanks! John.

comment:10 Changed 6 years ago by Ashley Sanders <a.sanders@…>

It does indeed output

5b 70 71 5d 5c 2e 5c 2e 5b 78 79 5d 
5b 70 71 5d 5c 2e 5c 2e 5b 78 79 5d 

comment:11 Changed 6 years ago by anonymous

Which leaves me stumped again.... what does Valgrind say?

comment:12 Changed 6 years ago by anonymous

Some supplementary questions before we get too involved:

1) Is

regex e("[pq]\\.\\.[xy]");

OK?

2) Is:

wregex we(L"[pq]\\.\\.[xy]");

OK?

3) Are their multiple versions of ICU installed on this system? Any chance there's a mismatch between the headers included and libraries loaded, or between the version used when Boost was built, and the one used by the test program?

4) Do the regex tests run OK? To run, cd into libs/regex/test and do a "bjam toolset=sun". Assuming ICU is installed in the usual location, you should see a message at the start to say it's being used/tested.

Thanks again! John.

comment:13 in reply to:  12 Changed 6 years ago by Ashley Sanders <a.sanders@…>

Apologies for the delay in doing the stuff you asked for. Life and work get in the way...

Replying to anonymous:

Some supplementary questions before we get too involved:

1) Is

regex e("[pq]\\.\\.[xy]");

Compiles and runs okay.

2) Is:

wregex we(L"[pq]\\.\\.[xy]");

Also compiles and runs okay.

3) Are their multiple versions of ICU installed on this system? Any chance there's a mismatch between the headers included and libraries loaded, or between the version used when Boost was built, and the one used by the test program?

There are multiple version of the .so files, but as far as I can tell there is only one set of header files. I don't think this should be a problem.

4) Do the regex tests run OK? To run, cd into libs/regex/test and do a "bjam toolset=sun". Assuming ICU is installed in the usual location, you should see a message at the start to say it's being used/tested.

I'm using gcc to compile so I did "bjam toolset=gcc". I'll attach the output separately. There were errors but they are a bit hard to spot from the warnings and errors spat out by the compiler. I'll attach two files. The first the output from running bjam the first time (rather a large file) and the second file is from running bjam again -- less output which hopefully makes it easier to spot the bus error from one of the tests.

Ashley.

Changed 6 years ago by Ashley Sanders <a.sanders@…>

Attachment: regex-gcc-test.txt.bz2 added

Output from first run of bjam toolset=gcc

Changed 6 years ago by Ashley Sanders <a.sanders@…>

Attachment: regex-gcc-test-2.txt added

Output from second run of bjam toolset=gcc

comment:14 Changed 6 years ago by John Maddock

My turn to apologize for the delay - I blame Christmas! :)

Thanks for running the tests, they show the same issue as you reported in your test case, I don't understand why it would work for wregex but not u32regex though :(

There must be some memory corruption/overrun going on, but it's going to be hard to diagnose by email! Is Valgrind available for that platform? If so it's output might help a lot, otherwise I'll have to write a special instrumented version for you to test with I guess.

Thanks, John.

Note: See TracTickets for help on using tickets.