[illumos-Developer] To sed, or not to sed...

Tue Dec 14 23:57:43 PST 2010

Btw, last night I updated this webrev with a fix for another interesting 
bug, which is that the FreeBSD version of regex we imported doesn't 
support \< and \> word delimiters (but it does support [[:<:]] and 
[[:>:]] for the same purpose.)  So the fix for *that* is part of this 
webrev.

The timer on this is ticking, so I'd really like to hear any feedback 
soon.  (Or, if someone has any reason why I shouldn't drive forward with 
this change, then I'd like to hear that too.)

    - Garrett

On 12/12/10 10:34 PM, Garrett D'Amore wrote:
>
> Here's my webrev, for my version:
>
> http://mexico.purplecow.org/gdamore/webrev/sed/
>
> Note that this depends on a change in libc, to enable REG_STARTEND.
>
>     - Garrett
>
> On 12/12/10 09:52 PM, Garrett D'Amore wrote:
>> So one of our "closed" gaps is "sed".
>>
>> Rich Lowe and I have each independently ported FreeBSD sed to 
>> illumos.  There are some minor differences though, which brings me to 
>> a question where I'd like to hear opinions -- preferably those backed 
>> by concrete supporting evidence.
>>
>> First off, a bit of background:
>>
>> As far as I can tell, xpg4's sed implementation attempts to adhere to 
>> POSIX by fully supporting multibyte characters, whereas legacy 
>> /usr/bin/sed treats the file as a stream of bytes.  In fact, legacy 
>> sed treats the file as pure ASCII.  Furthermore, legacy sed uses a 
>> different output format for the "l" command ... some things are 
>> escaped weird (backspaces and tabs become < and >) and a two digit 
>> octal form is used.   xpg4 sed uses backslash escapes for a few 
>> characters (\\, \a, \b, \f, \r, \t, \v) and 3-digit octal format for 
>> non-printable characters.
>>
>> So, I believe Rich's work adds support for building a separate XPG4 
>> and /usr/bin version, that gives the "traditional" behavior for "l".  
>> However, his version does not address the CSI problem at all.  
>> (Neither does mine, since I make no attempt at providing the non-CSI 
>> compliant legacy behavior.)
>>
>> IMO, this is an excellent time for us to simply ditch the legacy 
>> behavior, and move to the POSIX syntax that all other OS' use . This 
>> would enhance our compatibility with GNU sed, and *BSD sed.  (In 
>> fact, there are several other features that we will get to improve 
>> such compatibility, such as -i support, regardless of which port we 
>> ultimately go with.)
>>
>> I'm not emotional here.  I just don't want to create integrate new 
>> code to support legacy if there is no need for the legacy or if the 
>> legacy hurts us more than it helps us.
>>
>> If folks really think we should retain the legacy behavior (or as 
>> much of it as we can), I'm willing to go that route.  Personally, I 
>> *think* we may stand more to gain here by breaking with that legacy 
>> and going more towards POSIX/GNU/BSD compatibility.  However, I don't 
>> do much with sed beyond simple scripts, and indeed I've never used 
>> the "l" command.  So I freely admit that someone else may have a more 
>> complete picture here, and I'd like to hear more.
>>
>> If there are any sed wizards out there who have some good test 
>> scripts that I can easily test (send me the script, input files, and 
>> expected output), I'll be happy to verify correct functionality 
>> before I push towards integration of any sed replacement.
>>
>> I'd like to have a decision, and ideally code reviews and integration 
>> done, before the end of the week.  So please be timely in your feedback.
>>
>> Thanks!
>>
>>     - Garrett
>>
>>
>> _______________________________________________
>> Developer mailing list
>> Developer at lists.illumos.org
>> http://lists.illumos.org/m/listinfo/developer
>
>
> _______________________________________________
> Developer mailing list
> Developer at lists.illumos.org
> http://lists.illumos.org/m/listinfo/developer