Conversation with #inferno at Mon Mar 22 15:11:16 2010 on powerman-asdf@irc.freenode.net (irc) (15:59:07) ericvh [~ericvh@32.97.110.63] entered the room. (16:15:11) wrtp: what about the data structures allocated within re2 itself? what allocator are they using? (16:15:23) wrtp: powerman-asdf: ^^ (16:18:56) mjl- [~mjl@82-171-55-43.ip.telfort.nl] entered the room. (16:19:18) mycrofti1 [~ircguy@h69-128-47-242.mdsnwi.dedicated.static.tds.net] entered the room. (16:21:13) mkn left the room (quit: Quit: Page closed). (16:24:02) mjl-_ left the room (quit: *.net *.split). (16:24:03) maht left the room (quit: *.net *.split). (16:24:03) digi9 left the room (quit: *.net *.split). (16:24:03) mycroftiv left the room (quit: *.net *.split). (16:24:19) maht [~maht__@85-189-31-174.proweb.managedbroadband.co.uk] entered the room. (17:10:59) powerman: wrtp: i've added support for compiled regex (17:12:44) mycrofti1 is now known as mycroftiv (17:14:48) less1 [~pravin@32.97.110.63] entered the room. (17:17:15) robot12 left the room (quit: Quit: Ухожу я от вас (xchat 2.4.5 или старше)). (17:49:04) wrtp: powerman-asdf: any chance you could make the module interface look exactly like this: Regex: module { (17:49:04) wrtp: PATH: con "$Re2"; (17:49:04) wrtp: Re: type ref RE; (17:49:05) wrtp: compile: fn(nil:string,nil:int): (Re, string); (17:49:05) wrtp: execute: fn(nil:Re, nil:string): array of (int, int); (17:49:06) wrtp: executese: fn(nil:Re, nil:string, se: (int, int), bol: int, eol: int): array of (int, int); (17:49:06) wrtp: RE: adt { (17:49:06) wrtp: x: int; (17:49:06) wrtp: } (17:49:07) wrtp: } (17:49:07) wrtp: ? (17:49:36) wrtp: that would enable plug n play replacement in existing regex users (17:50:27) wrtp: note that it's important that ints are used for submatches, not strings, as otherwise you can't know where the submatch occurred (17:50:36) wrtp: (no pointer arithmetic in limbo) (17:51:26) powerman: wrtp: i don't think this should be done in this module (17:51:34) wrtp: powerman-asdf: no? (17:51:38) powerman: re2 doesn't return offset for matches (17:52:13) wrtp: it does, because you can always do (match - orig-string) to find out where it occurred (17:52:15) powerman: so it require substring search of each match, etc - it looks like task for another module build on top of my re2 module (17:52:41) wrtp: i'm sure (i haven't checked though) that re2 will return a pointer into the original string (17:53:40) powerman: it doesn't (17:54:09) powerman: actually, it can return submatches as non-strings - with automatic conversion to int/real/whatever (17:54:24) wrtp: are you walking about FullMatch here? (17:54:27) wrtp: s/walk/talk/ (17:54:29) powerman: yep (17:55:10) powerman: FullMathN, actually, but it isn't differ in this aspect (17:55:11) wrtp: why not just use a StringPiece, or an int (17:55:14) wrtp: ? (17:56:21) powerman: it return Arg objects, and while they probably can be StringPiece, I doubt these StringPiece will be pointers to substring of initial StringPiece (17:56:34) wrtp: that's the whole point, i think (17:56:39) wrtp: but you can use ints too (17:56:52) powerman: ints? for what? (17:57:16) powerman: it return int for regex like (\d+), it's not position of match as int (17:57:52) wrtp: really? (17:57:57) powerman: yep (17:58:46) wrtp: the thing is, the way the interface is currently, you can't use it for submatch replacement (17:59:03) powerman: I don't see how to get match position/offset from re2. Only matched substrings, optionally converted on-the-fly to compatible types like int or into different encodings (17:59:21) powerman: For submatch replacement there are Replace() and GlobalReplace() (17:59:30) powerman: right now i'm working on wrapping them :) (17:59:41) wrtp: i don't believe it's not possible! (18:00:12) powerman: ask rsc :) (18:02:51) powerman: btw, actually trying to detect offset using matched substrings will fail, so it's impossible to simulate regex(2) for any non-trivial regexps. for example: "abcdce"=~/d(c)/ (18:04:53) powerman: is there any examples of kernel modules which throw exception? this looks like best way to handle syntax error in regexp (18:05:00) wrtp: yeah but i don't believe that (18:06:55) powerman: probably out-of-memory errors should throw exception in kernel modules, but I afraid this code too deep (18:06:57) wrtp: just pass a string piece, call data() and do the subtraction (18:17:00) powerman: question about exceptions expired (18:17:38) powerman: ... and should be invalidated from cache :) (18:18:12) wrtp: ack (18:39:03) mennis left the room (quit: Quit: mennis). (18:43:28) wrtp: i'd forgotten how impenetrable C++ could be :-) (18:44:14) wrtp: ... so where's FullMatch declared then? (18:46:09) wrtp: oh i see, it's some template magic (18:51:52) bvalek2 left the room (quit: Quit: Page closed). (18:56:09) powerman: wrtp: do you see any way to detect case when regexp match failed because user provide not enough placeholders for all parens? (18:56:49) powerman: only idea I've is to call NumberOfCapturingGroups() on every failed match (18:57:39) powerman: or when regexp created, and remember it somewhere for checking before each call to *MatchN() (18:59:57) wrtp: that's why regexp returns an array that it creates (19:00:43) wrtp: have you managed to get the match offset yet? i can't even get the match out - i don't understand exactly what RE2::Arg does... (19:02:34) powerman: RE2::Arg is funny thing :) (19:02:51) wrtp: like, i thought that this should work: string m; (19:02:51) wrtp: if(RE2::FullMatch("axbyc", "a(.*b).*c"), &m) { (19:03:04) wrtp: and the match succeeds, but m isn't set to anything (19:04:04) wrtp: as far as i can see, Arg is just there so other things can be cast to it (and encapsulates the type with the parsing function) (19:04:07) powerman: you closed parens before ,&m ? (19:04:44) wrtp: no. good f***in' point! (19:04:48) wrtp: jeeze (19:04:59) powerman: hmm. I think it was irc-only typo (19:05:07) powerman: is if() in c++ accept comma??? (19:06:09) powerman: hm. i think it was usual expr1,expr2 in c-style. but it's unusual to see that in if() (19:06:29) wrtp: yeah, it does accept it (19:06:38) wrtp: and it's always true (19:06:55) wrtp: so it was looking as if it was succeeding (19:07:03) wrtp: anyway, after all that crap (19:07:08) wrtp: my first hunch was true (19:07:22) wrtp: you *can* get the offset by passing in a StringPiece pointer (19:10:50) wrtp: got it working with FullMatch. just can't quite figure out the correct incantation for FullMatchN (19:11:19) wrtp: it doesn't like me casting re2::StringPiece* to RE2::Arg* (19:14:01) wrtp: the VariadicFunction2 template magic knows how to do it... (19:20:11) wrtp: ok, i just saw the way you did it... just replace std::string with StringPiece in your wrapper, and it'll work fine. (19:20:37) wrtp: you could also create a custom "parser" that just saved the offsets directly (19:21:00) mkmks [~nf@90-230-91-71-no148.tbcn.telia.com] entered the room. (19:33:09) powerman: wrtp: can you show me example of fetching initial string offsets from returned Arg/StringPiece's? (19:37:32) powerman: use initial_str.data() as pointer to string start, use argN.data as pointer to substring start, and argN.size as size of substring? (19:37:40) wrtp: no wait (19:38:24) wrtp: http://paste.lisp.org/display/96765 (19:39:37) wrtp: but as i said, you could create a type with a "bool T::ParseFrom(const char*, int)" method and have it parse directly into the limbo (int, int) array (19:41:14) powerman: wrtp: if you remember, I don't know c++, really. so I don't understand what you just said. (19:42:14) powerman: i suppose you speak about adding new method to StringPiece, which probably involve subclassing, which is completely outside of my c++ skills (19:42:56) wrtp: powerman-asdf: i don't know C++ at all! (19:43:20) wrtp: powerman-asdf: i'm just trying to write some code to test my hypothesis... (19:44:01) wrtp: the ParseFrom method is mentioned in the documentation (just above FullMatchN in the header file) (19:47:34) wrtp: thinking about it, i don't think it's worth it, as you'll have to initialise N of the types anyway; might as well use StringPieces and bung 'em in the offset array after the match. (19:48:16) powerman: to provide regex(2) compatibility we'll need "The element with index 0 gives the character positions of the first character of some longest leftmost match and the first character beyond the match." (19:48:33) powerman: this mean we'll have to enforce additional parens around full regexp (19:49:02) wrtp: i don't think so (19:49:12) wrtp: because you'll be filling out the array anyway (19:49:21) wrtp: uh (19:49:43) wrtp: is there no way to find out that info from the normal re2 interface? (19:50:34) powerman: no, but I think this isn't a big deal (19:50:57) powerman: only regex(2) emulation layer/module will enforce additional parens (19:51:05) wrtp: yeah except it means you have to modify the pattern string (19:51:49) powerman: if you'll use regex(2) compatible compile() - it will quietly add parens around regex, that's it (19:52:03) wrtp: hmm, you can find the *end* of the match by using ConsumeN (19:57:27) wrtp: ah, looks like you might be able to use re.Match directly (19:57:34) wrtp: that's what the Replace functions do (19:59:15) wrtp: it's actually a better interface for what you're doing, i think (20:01:49) mennis [~mennis@adsl-068-016-104-079.sip.asm.bellsouth.net] entered the room. (20:14:16) wrtp: here's what i'm talking about: http://paste.lisp.org/+22NX/1 (20:15:02) powerman: wrtp: what about utf8-16? we take offset for char*, it's utf-8, but for limbo it has to be converted to offset in characters, i.e. utf-16 (20:15:25) wrtp: in Match, the zero'th element of the StringPiece array is the offset of the whole match (20:15:34) wrtp: very good point (20:16:04) wrtp: well, if there are only ascii chars in the input string (which you can know instantly) then you know you don't have to do any conversion (20:16:24) wrtp: otherwise you're going to have to convert the offsets back into runes (20:20:59) powerman: if match() is going to return both substrings and offsets, looks like it's better to change it interface, so it will return ref adt (nil on failed match), and this adt will contain two fields: "s: array of string; off: array of int;" (20:22:31) powerman: to fix offsets we'll have to extract part of initial string from beginning to offset of each match, do c2string(), and length of returned limbo string will be corrected offset (20:22:58) powerman: but I've no idea how to get limbo string length in C yet. any hints? (20:25:12) rapidfx left the room (quit: Quit: Leaving.). (20:29:17) wrtp: powerman-asdf: just go with the original adt. just return offsets. (20:29:26) wrtp: powerman-asdf: the user can take slices if the user wants. (20:30:11) wrtp: to fix offsets, it's probably best just to count from the start of the string, filling out offsets as you go (20:30:48) wrtp: i *think* that offsets are always going to be in numerical order... ah, but lengths aren't hmm (20:30:48) powerman: return offsets where? substrings returned in 3rd param. create 4th? which will be nil in most cases? (20:31:24) powerman: oh, I understand (20:31:25) wrtp: return offsets in the array that's returned from match (20:31:38) powerman: no, I don't like original regex(2) interface (20:31:42) wrtp: why not? (20:32:10) powerman: in my experience user almost never need offsets, but need substrings instead (20:32:27) wrtp: it's trivial to get substrings. (20:32:37) powerman: I agree to support regex(2) compatibility layer, but not live with it interface in my code (20:33:10) wrtp: thing is, creating a new string is expensive (relatively) because it requires an allocation (20:33:26) wrtp: so returning offsets is not a bad way of doing it (20:33:39) wrtp: rob pike designed the interface, so i don't think it can be too bad :-) (20:33:55) wrtp: anyway, gotta go (20:34:01) wrtp: will look in tomorrow (20:35:17) powerman: trivial, but why do (20:35:17) powerman: if((m = regex->execute(re_uk, s)) != nil){ (20:35:17) powerman: n1 := s[m[1].t0:m[1].t1]; (20:35:17) powerman: n2 := s[m[3].t0:m[3].t1]; (20:35:17) powerman: if we can use much more readable (20:35:17) powerman: if((m = re2->match(s, re_uk)) != nil){ (20:35:17) powerman: # just use m.s[0] instead of creating new var n1 (20:35:17) powerman: # just use m.s[1] instead of creating new var n2 (20:37:12) powerman: as for cost of creating a new string - string slices create same new string anyway (20:37:42) powerman: so, if you need substring _text_ - new substring will be created anyway (20:38:16) powerman: and, as I said, in my practice I nearly never need plain offsets without substring (20:38:59) wrtp: if you return strings, you can't tell the difference between a null match and no match (20:41:05) powerman: yep. that deficiency documented in re2 interface (20:41:15) wrtp: also, it's common to do substitutions with only one or two matched subexpressions picked out. e.g. s/.*([0-9]+.(abc)*)/\1x/ (21:10:56) vs is now known as vsrinivas (21:44:27) wrtp left the room (quit: Ping timeout: 268 seconds). (22:05:29) wrtp [~rog@78.148.71.250] entered the room. (22:09:27) powerman: wrtp: I've checked details of libinterp/string.c, and gonna agree with your about offsets vs substrings. (22:11:56) powerman: providing regex(2) compatible interface will require significantly less alloc()+memmove() operations, and fetching matched substrings using string slices with these offset+end will also require less alloc/memmove than if we try to deliver already extracted substrings from re2 to Limbo (22:16:35) vsrinivas: cool. (22:21:39) mennis left the room (quit: Quit: mennis). (22:22:20) visof [~visof@41.206.149.165] entered the room. (22:22:21) mennis [~mennis@adsl-068-016-104-079.sip.asm.bellsouth.net] entered the room. (22:24:03) mennis left the room (quit: Read error: Connection reset by peer). (22:24:18) mennis [~mennis@adsl-068-016-104-079.sip.asm.bellsouth.net] entered the room. (22:42:12) wrtp: powerman-asdf: i knew you'd come round :-) (22:43:14) powerman: wrtp: but I anyway wanna substrings and not offsets as result of regexp match! :) I'll just implement this on top of regex(2) interface. (22:43:33) wrtp: powerman-asdf: sure, it's only a couple of lines of code (22:45:23) powerman: wrtp: i'm now trying to avoid ambiguity between "parens not match" and "parens match empty substring" (22:45:39) wrtp: powerman-asdf: if i was after a really low overhead regex interface, i'd consider one where you pass in the array of offsets instead of returning it, so you can use the same array each call (22:46:06) wrtp: return empty array vs return nil array (22:46:10) powerman: looks like in both cases word[i] == NULL, but word[i].data() == NULL only if parens not match (22:46:28) wrtp: oh i see, in re2 (22:46:53) powerman: how they managed to call .data() on NULL pointer? and, more importantly, why they need this crap in C++? (22:47:49) wrtp: hmm, doesn't seem quite right to me. (22:50:25) wrtp: are you using re.Match() or FullMatchN ? (22:50:48) powerman: http://pastebin.com/GLcYPiYb (22:51:02) powerman: output example at end (22:53:03) wrtp: use re.Match() - it's a much better interface for you to use (22:53:34) wrtp: and it's easy to tell the difference between no-match and null-match (22:53:52) wrtp: it's also more efficient and uses less code (22:56:46) wrtp: and much less C++-hacky (22:56:56) wrtp: (i.e. i can understand it!) (22:59:18) powerman: wrtp: I don't see any differences between re.Match and PartialMatchN(re) - they fill array with substrings in exactly same way (22:59:44) wrtp: re.Match is defined in terms of StringPieces not Args (23:00:08) wrtp: so you can just pass in an array of StringPieces, not an array of pointers to StringPieces (23:01:00) powerman: yeah, and it also allow to define startpos. but we was talking about detecting no match vs match empty string, and they doesn't differ from this view (23:02:06) wrtp: for a given match m, m.data() is null iff the match was not made. an empty match will have m.data() != null but m.length() == 0 (23:03:39) powerman: yeah, I already figured this out. but funny thing is m in both cases == NULL (23:04:57) wrtp: with re.Match, the stringpieces aren't pointers (23:05:11) wrtp: so they can't be null (23:09:06) powerman: are you sure? I've switched from PartialMatchN to Match, but output of my test code wasn't changed (except there one more additional element - overall match in word[0] - just like regex(2) wanted) (23:09:29) wrtp: oh yeah, that's the other reason it's good to use Match (23:10:12) wrtp: don't look at word[i] - just look at word[i].data() (23:10:16) wrtp: that's all you need (23:10:47) wrtp: and don't ask me why the system allows you to compare a StringPiece to NULL :-) (23:11:43) wrtp: it's some C++ hackery i have no doubt (23:12:19) wrtp: above, when i said m.length(), i actually meant m.size() (23:17:03) powerman: int size() const { return length_; } (23:17:03) powerman: int length() const { return length_; } (23:17:04) powerman: :) (23:17:28) wrtp: :-) didn't see that (23:17:42) wrtp: i think your most interesting challenge here is to translate the utf-8 offsets into rune offsets without going O(n^2) (23:20:00) adelfino [~username@201-212-160-19.net.prima.net.ar] entered the room. (23:20:22) wrtp: maybe it's best just to do the initial offsets by going once through the string, but to do the end offsets using utfnlen (23:21:07) powerman: i plan to copy c2string() source into my module, limit it to only calculate character positions without actually creating new strings, and use to fix offsets (23:22:01) wrtp: no need to do that (23:22:18) wrtp: or, at least, i don't think so (23:22:19) powerman: parens can't intersect, so chances are it's possible to implement as O(n) where n is initial string length, not amount of parens (23:22:45) wrtp: just go through the string calling char2rune() (23:23:06) wrtp: when you get a matching offset, store the rune position and advance (23:24:02) wrtp: take advantage of the fact that for all j > i, m[j].data() >= m[i].data() (23:24:19) wrtp: (unless m[j].data() is null of course) (23:24:39) wrtp: where m represents the array of matches (23:25:23) wrtp: this algorithm is O(1) for finding all the starting offsets. (23:25:42) wrtp: oops i mean O(n) where n is the length of the string (23:26:19) wrtp: oh, just saw your comment (23:26:22) powerman: hehe :) (23:27:06) wrtp: the simple way to do O(n) for everything is to malloc an array of int of size n (23:27:24) wrtp: then store the rune offset for each char in that array (23:27:38) wrtp: only problem with that is it will use a lot of space if n is large (23:28:05) wrtp: it's probably worth not doing that (23:29:24) powerman: everything right except two things: at first we can optimize all this away in case initial string was ascii, at second I don't have .data() because I avoid copying substring returned by re2 to C module, instead I'll return "int off[], int end[];" and will need to fix both off[] and end[] (23:30:14) powerman: and while off[n] <= off[n+m], same isn't true for end[n] and end[n+m], which may increase alg complexity (23:34:04) powerman: another thing - it doesn't clear how to emulate executese() (23:35:33) powerman: start point can be set using re.Match. end point support will require additional string copy (to slice initial string[0:end]). (23:36:08) powerman: bol support will probably require slice initial string[start:end] instead of using re.Match support for startpos (23:37:15) powerman: eol is a problem because we'll already have to slice string to support endpos, and thus eol flag will always work as set (23:46:26) robot12 [~robot12@inferno.kgts.ru] entered the room. (23:49:06) wrtp: powerman-asdf: yeah, it's the end offsets that are the problem (23:49:39) powerman: we can detect eol=0 and raise (23:49:53) powerman: incompatible compatiblity layer - usual thing :) (23:49:55) wrtp: does "xx" match "xx$" ? (23:50:19) powerman: sure (23:50:36) powerman: at least in perl (23:51:05) powerman: $ at end of text or line (m=true) (23:51:12) powerman: that's from re2 syntax page (23:59:54) wrtp: thing is the eol thing is necessary if you're going to do sam-like regexp evaluation (00:00:13) wrtp: e.g. x/./g/$/d (00:01:33) visof left the room (quit: Ping timeout: 252 seconds). (00:06:59) wrtp: might be able to do something with the one_line option (00:09:00) robot12 left the room (quit: Quit: Ухожу я от вас (xchat 2.4.5 или старше)). (00:10:51) ericvh left the room. (00:24:48) vsrinivas is now known as me____ (00:27:50) wrtp: one_line = !(eol || bol) (00:27:57) wrtp: should do a good enough job (00:28:11) wrtp left the room (quit: Quit: wrtp). (00:55:04) less1 left the room (quit: Ping timeout: 245 seconds). (01:10:57) less1 [~pravin@32.97.110.64] entered the room. (01:12:01) less1 left the room (quit: Client Quit). (01:20:49) mkmks left the room (quit: Quit: ERC Version 5.3 (IRC client for Emacs)). (01:25:33) less1 [~pravin@cpe-66-68-151-36.austin.res.rr.com] entered the room. (01:39:00) less1 left the room (quit: Quit: Leaving.). (01:48:59) adelfino left the room (quit: Remote host closed the connection). (02:16:54) me____ left the room (quit: Quit: leaving). (03:07:21) mennis left the room (quit: Quit: mennis). (07:17:11) mkn [~7c7cdbfa@gateway/web/freenode/x-mcdfcpqiugwirmuz] entered the room. (07:29:56) robot12 [~robot12@szhilkin.broker.freenet6.net] entered the room. (08:36:43) bvalek2 [~c11a2f4d@gateway/web/freenode/x-sbvdwbulntofmpot] entered the room. (09:23:28) ***robot12 can't compile j2d :( (09:37:03) anth_x: unsurprising; it's ancient. where'd you even get it from? (10:10:28) robot12: on vitanuova site :) (10:10:37) robot12: just typed url :) (10:10:56) robot12: http://www.vitanuova.com/dist/old.java.tgz (10:11:19) robot12: ops // (10:11:31) robot12: http://www.vitanuova.com/dist/old/java.tgz (10:55:16) wrtp [~rog@78.148.71.250] entered the room. (12:22:40) adelfino [~username@201-212-160-19.net.prima.net.ar] entered the room. (12:31:48) powerman: wrtp: I had push yesterday implementation with offsets. Function to fix off/end is O(n), but I was very sleepy while implementing it (it happens at ~5-7 am), so there probably more simple solution exists, which will use less local variables. But it works and even passed my tests (which I can't push to googlecode because I've no idea where to create directory for tests in inferno tree). (12:32:34) wrtp: powerman-asdf: cool (12:32:40) wrtp: i'll have a look (12:33:01) wrtp: what time zone are you in? (12:33:50) powerman: Ukraine/EET/GMT+2. There also new proposed module interface commented in re2.m. So, if you don't have other critical ideas about what to change, only thing left is implement wrappers for other functions. (12:42:11) wrtp: FYI standard formatting in inferno is single-tab indentation. it'd be good if re2.c followed that. (12:49:56) powerman: i'm already working on fixing this. (13:05:43) powerman: done (13:22:56) adelfino left the room (quit: Quit: Leaving). (13:27:50) wrtp: one other minor thing: it's conventional to put braces around single-statement blocks of more than one line, even if it's not necessary. (13:31:52) powerman: wrtp: really? i don't see this in most of inferno sources, for ex.: libinterp/string.c:/^OP(cvtcf) (13:33:10) powerman: i see only necessary braces everywhere I look at. maybe I looking at wrong places... :) (13:35:53) wrtp: a dangling else if is a special case (13:37:20) ***wrtp looks for examples (13:39:00) wrtp: e.g. emu/port/chan.c:/^createdir (13:40:01) powerman: ok, i see what you mean, thanks (13:42:33) wrtp: i'm trying to understand fixoffend. i think you might be assuming that the match ends are in reverse order. is that true? (13:50:30) powerman: no (13:50:46) powerman: i use stack(lifo) to re-order them on-the-fly (13:51:38) powerman: this way, next offset or endpos to fix is always one of two - either next offset or endpost on top of stack (13:52:14) wrtp: don't you need a heap rather than a stack? (13:52:40) powerman: i need lifo queue :) (13:53:09) wrtp: the problem is that you're always going forwards through the string, but the endpoints can go in either direction (13:53:50) wrtp: a heap would solve the problem, but is probably overkill (13:56:40) powerman: ok, see. the thing i'm fixing right now is either offset or endpos. if it's offset, then fixing it mean i'm just "open parens" => next thing to fix will be either closing parens for _this_ open parens, or next open parens => I push closing parens for this open parens to lifo queue => now next thing to wait is either next open parens or next closing parens _in_lifo_ (13:57:59) powerman: if it's endpos, then fixing it mean i'm just "closed parens" => next thing to fix will be either next open parens or last-not-closed closing parens (which is now next one in lifo) (13:59:12) powerman: and so we back to initial situation - next thing to fix is either next open parens off[pos] or last not closed closing parens end[endpos[endq-1]] (14:01:00) wrtp: hmm. can you explain how your algorithm will deal with matching "((a)?b)y(c)" against the text "aabyc" (14:01:43) powerman: 1. start with looking for first open parens (pos=0, off[pos]) (14:02:11) wrtp: here are the matching ranges, BTW: 0 3, 0 2, 4 5 (14:03:12) powerman: 2. when it found push closing parens for this open parens to lifo (lifo = (0) - it keeps index in end[]) (14:04:18) powerman: 3. check which one comes first (i.e. which one is less): next offset (off[pos+1]) or last not closed open parens (end[0]), found it's next open parens and start looking for it (14:04:49) powerman: 4. found it, push into lifo it's endpos (lifo = (0, 1)) (14:05:35) powerman: 5. check which one comes first: found it's last not closed parens (endpos[1] - index 1 is on top in lifo) (14:06:02) wrtp: oh i see (14:06:21) powerman: loop string to offset of endpos[1], found it, pop it's index (1) from lifo, so now lifo again = (0) (14:06:39) powerman: next looking for next of either offset[2] or endpos[0] (14:06:40) wrtp: s/it's/its/ :-) (14:10:42) powerman: as per my taste, there too much variables, and variable names isn't clear enough. probably using usual C pointer arithmetic things may become simpler, but I don't feel comfortable enough to think in *p++ terms yet and prefer to use array/index syntax for sure (14:10:47) wrtp: clever. i *think* you can do it quite a bit more simply... (14:12:59) powerman: but I think for something implemented at 6am it's good enough :) (15:13:47) ericvh [~ericvh@32.97.110.63] entered the room. (15:14:48) wrtp: powerman-asdf: what about something like this: http://pastebin.com/tBe4eHuD (15:16:52) wrtp: if you declare something like the Range struct in the library interface, you can get the library to fill out the limbo data structures directly (15:17:45) wrtp: ericvh: hi (15:24:19) powerman: wrtp: thanks, I'll check it (15:24:41) wrtp: powerman-asdf: it essentially uses your approach, but pares it down a bit (15:25:14) wrtp: i've only tested it briefly, and it worked first time, which is a bad sign :-) (15:26:15) powerman: :) i'm right now trying to implement replace(), and wanna finish it first and then return to offsets (15:27:26) wrtp: BTW, you do realise (i'm sure you do) that your signature for replace was wrong - you need to return the string... (15:29:52) powerman: wrtp: btw, I always wonder - is there exists automated tests for inferno source? it's not in googlecode or vn site, and it wasn't mentioned anywhere, but... it's impossible to implement such a large thing without automated tests? or my kung fu just isn't strong enough? :) (15:30:32) wrtp: there probably should be automated tests, but there aren't (15:40:47) powerman: wrtp: hmm. yeah, replace() interface is broken. solution: don't develop in 3 different languages simultaneously (15:41:04) wrtp: :-) (15:49:36) mennis [~mennis@adsl-065-012-170-146.sip.asm.bellsouth.net] entered the room. (15:51:03) mennis left the room (quit: Client Quit). (16:20:06) mennis [~mennis@adsl-068-016-104-079.sip.asm.bellsouth.net] entered the room. (16:36:45) mkn left the room. (16:42:17) ericvh left the room (quit: Read error: Connection reset by peer). (16:43:13) ericvh [~ericvh@32.97.110.63] entered the room. (17:01:55) mkmks [~nf@90-230-91-71-no148.tbcn.telia.com] entered the room. (17:02:59) robot12 left the room (quit: Quit: Ухожу я от вас (xchat 2.4.5 или старше)). (17:07:31) less1 [~pravin@32.97.110.63] entered the room. (17:20:04) powerman: when C module function return tuple, like (int, string), in standard modules I see this snippet at very beginning of the function, before it start doing actual work: (17:20:04) powerman: tmp = f->ret->t1; (17:20:04) powerman: f->ret->t1 = H; (17:20:04) powerman: destroy(tmp); (17:20:44) powerman: I don't see why it's needed. Everything works just fine without it, no memory leaks/segfault. (17:22:17) powerman: Even in case it's needed, I don't see what's tmp var used for - why not "destroy(f->ret->t1); f->ret->t1=H;"? or it's protection against killing thread in the middle? (18:20:25) less1 left the room (quit: Ping timeout: 264 seconds). (18:30:27) wrtp: hmm, i suppose the tmp form is faster (18:30:41) wrtp: oh no, i see (18:30:43) wrtp: i think (18:31:23) wrtp: if you don't set it to nil first, then you've got a garbage pointer, and if the GC was concurrent, that would be a problem (18:32:44) wrtp: and if you don't destroy the dest pointer, then the garbage will only be reclaimed when the cyclic GC catches up to it, rather than immediately as it should (18:32:51) wrtp: powerman-asdf: ^^ (18:37:06) less1 [~pravin@32.97.110.64] entered the room. (18:42:30) less1 left the room (quit: Ping timeout: 268 seconds). (18:42:47) powerman: wrtp: you talking about f->ret? (18:42:58) wrtp: yes (18:43:17) powerman: who will store any alive pointers inside f->ret before calling module? (18:43:45) powerman: f->ret should be either non-initialized (may trouble with gc) or set to H (18:43:55) powerman: in both cases there nothing to destroy() (18:44:42) wrtp: powerman-asdf: what about: x := f(); x = f() (18:44:42) wrtp: ? (18:44:59) wrtp: in that case the f->ret in both cases points to x (18:45:16) wrtp: the second time it holds the previously returned value, which needs to be destroyed (18:46:16) powerman: i'm not sure f->ret point to real target... if that's will be true, they without destroy(f->ret) there should be leaks, but there are no leaks in this case (18:46:30) powerman: they->then (18:50:05) wrtp: there wouldn't be leaks - because the GC would collect it eventually (18:55:21) bvalek2 left the room (quit: Quit: Page closed). (18:58:56) less1 [~pravin@32.97.110.63] entered the room. (19:05:50) anth [none@cpe-76-189-197-62.neo.res.rr.com] entered the room. (19:16:41) powerman: wrtp: ok, let me check is i understand that correctly: (19:16:41) powerman: 1) f->ret points to real, existing user's variable, which current value should be freed before replacing with new one; so in this case for Limbo code "int i = 5; i = f();", if f() is in C module, then f->ret will be equal to 5 when f() called. (19:16:41) powerman: 2) so old value in f->ret should be freed (if it's reference type value) before replacing by new value (but if we forget to do this it will be eventually freed by gc anyway) (19:16:41) powerman: 3) destroy(x) free object, but doesn't change x value, so it become garbage pointer (why not protect against this by changing destroy(x) to destroy(&x) and have destroy set x=H?) (19:16:41) powerman: 4) to avoid seeing this garbage pointer by gc (if it one day become concurrent) this pointer first moved to local variable (while gc able to find such pointers inside f-> it can't find them inside local variables) (19:18:17) powerman: btw, why it was implemented this way? I mean, why C module function have access to _current_value_ of variable where result of this function will be stored? (19:19:07) wrtp: that all seems about right (19:19:16) wrtp: to answer your questions: (19:20:11) wrtp: it's quite possible that you have a pointer but it's not stored anywhere; so no need to set any memory location to H (19:21:23) wrtp: powerman-asdf: it's implemented this way for efficiency. limbo doesn't know it's calling a C-implemented function - and that's just the way that the calling convention works (19:22:57) powerman: ok, thanks! (19:27:45) wrtp: BTW, i slightly modified the offset calculation function i posted earlier to be slightly smaller/more efficient. http://pastebin.com/4dY9mefb (19:39:48) bvalek2 [~bela@unaffiliated/bvalek2] entered the room. (19:48:21) bvalek2 left the room (quit: Quit: I've seen things you people wouldn't believe). (19:50:10) powerman: wrtp: I see many standard modules do this: "destroy(*f->ret); *f->ret = H;". And this code is safe only while gc isn't concurrent. Correct? (19:51:28) wrtp: i think it's different for structured return types (19:52:11) wrtp: in the above example, f->ret is wholly owned by the callee (19:52:22) wrtp: hmm (19:52:46) wrtp: if in doubt, follow existing conventions! (19:53:18) wrtp: there are all kinds of subtleties, many of which i am not aware of... (20:05:51) mkmks left the room (quit: Quit: ERC Version 5.3 (IRC client for Emacs)). (20:23:39) wrtp left the room (quit: Ping timeout: 240 seconds). (20:33:21) wrtp [~rog@89.241.199.20] entered the room. (20:55:34) powerman: wrtp: i'm checking your last code, isn't malloc should be using n+1 instead of n? imagine this case "((()))" (20:57:12) powerman: wrtp: next, in case there bug in re2, you can get wrong offsets, but there no protection against this case and so you can go outside string looking for these offsets (21:00:38) wrtp: you're right about n+1 (21:00:55) wrtp: if there's a bug in re2, then the whole thing could crash anyway, so it's a moot point (21:01:15) wrtp: i think it's better to assume that re2 works correctly (21:01:28) wrtp: it's been very well tested (21:03:12) wrtp: you could use utfnlen instead of the char2rune loop if you wanted to protect against running off the end of the string (21:04:40) wrtp: hold on, no, the offsets have been calculated by subtracting pointers anyway. so unless re2 is returning a pointer outside the string (which could crash any program), the offsets will be within s. (21:07:01) wrtp: but utfnlen is probably faster for the usual case of lots of ascii and an occasional non-ascii char (21:07:58) powerman: i think utfnlen isn't needed, strlen is enough: we use if for char* string returned by string2c, so we can assume it's correct utf+\0 (21:08:45) powerman: so, using += chartorune we can at most eat one more byte outside string, which is \0, and chartorune will return 1 in this case (21:09:09) wrtp: no, i mean you can use utfnlen for the entire chartorune loop (21:09:26) wrtp: (the point of the loop is to count runes) (21:12:08) powerman: btw, are you sure it's safe to calculate offsets using StringPiece's .data()? I don't see any documented guarantees it will always point inside original string. if implementation details of RE2 or StringPiece classes will change, this may not be true anymore (21:12:43) wrtp: the whole point about StringPiece is it points into another string (21:13:05) wrtp: BTW the chartorune can change into this: fix = *--q; (21:13:05) wrtp: ch += utfnlen(t, *fix - (t - s)); (21:13:06) wrtp: t = s + *fix; (21:13:06) wrtp: *fix = ch; (21:13:18) wrtp: the chartorune loop, i meant to say (21:16:45) mennis left the room (quit: Quit: mennis). (21:17:12) wrtp: the re2 source compares pointers like that (although it uses .begin() instead of .data() - they return the same thing) (21:19:46) powerman: re2 source can do this - afaik authors of re2 and StringPiece are same (21:20:14) wrtp: yeah, but it means they're not gonna change that assumption in a hurry (21:20:30) wrtp: and anyway, if you can't do that, you can't find out where the match was in your string. (21:20:45) wrtp: and i'm certain they want you to be able to do that (21:22:15) wrtp: string (matched piece is copied to string) (21:22:16) wrtp: StringPiece (StringPiece is mutated to point to matched piece) (21:22:30) wrtp: that's from the docs (21:22:35) powerman: yeah, ok then (21:37:48) powerman: wrtp: my old tests passed with your function :) (21:41:28) wrtp: good. i still can't quite believe that it worked first time. that usually indicates a deep-seated bug :-) (21:52:15) mennis [~mennis@adsl-065-012-170-146.sip.asm.bellsouth.net] entered the room. (22:04:16) powerman: wrtp: pushed (22:09:15) anth_x left the room (quit: Read error: Connection reset by peer). (22:12:33) wrtp: powerman-asdf: you can delete origr - it's a hangover from testing (22:12:52) powerman: wrtp: but I think my version was more ease to read. things like *q[-1] and keeping many end offsets plus maximum one start offset in same queue are clever but I think that's more like hack/overoptimization than ease-to-maintain code (22:16:57) anth_x [~a@adsl-99-25-148-5.dsl.bcvloh.sbcglobal.net] entered the room. (22:27:21) wrtp: using the end offset and comparing pointers is a standard idiom (22:27:42) wrtp: *q[-1] is probably the worst infraction (22:28:12) wrtp: and q should probably be called something else, because it's a stack not a queue (22:31:37) wrtp: but i didn't optimise - i simplified. see that there's only 2 conditions and 2 loops. compared to the original with 5 conditions and 4 loops (22:38:40) powerman: i agree you simplified obvious logic, but not for free - part of this logic was moved to not-so-obvious pointer arithmetic and data structs. anyway, in general it's better so we'll use it and hope there no promised deep-seated bugs :) (22:39:40) wrtp: powerman-asdf: i have to say that i failed to understand your original code until you explained it to me :-) (22:40:58) powerman: yeah, I agree my opinion about comparing readability of mine and yours code can't be trusted :) (22:41:17) wrtp: yeah (22:41:29) wrtp: i think that in general it's a non-obvious algorithm (22:41:50) powerman: obvious one unlikely will be O(n) (22:41:51) wrtp: i wonder if a comment like this before the *q++ lines might work: (22:41:52) wrtp: // if next range starts before the (22:41:53) wrtp: // current range's end, then it will end (22:41:53) wrtp: // before it too. hence we can use a simple (22:41:53) wrtp: // LIFO stack to maintain the next offsets to fix up. (22:43:08) powerman: but i'm not sure there real needs in optimizing this code to O(n) (22:43:19) powerman: actually, it was done mostly for fun (22:49:27) wrtp: yeah, you're right (22:49:43) wrtp: could've done the whole lot with a single loop and utfnlen... (22:50:26) wrtp: i think that's what i'd've done first, but your code was a challenge, because i hadn't realised it could be done, and then i couldn't understand it! (22:53:28) visof [~visof@41.238.234.17] entered the room. (22:58:05) powerman: i'm sorry :) (22:58:40) wrtp: i really suffered (23:02:09) visof left the room (quit: Ping timeout: 240 seconds). (23:19:19) wrtp left the room (quit: Quit: wrtp). (23:22:26) mkmks [~nf@90-230-91-71-no148.tbcn.telia.com] entered the room. (23:55:19) adelfino [~username@201-212-160-19.net.prima.net.ar] entered the room. (23:55:27) adelfino left the room (quit: Remote host closed the connection).