TIP: Click on subject to list as thread! ANSI
echo: rberrypi
to: MARTIN@MYDOMAIN.INVALID
from: DAN CROSS
date: 2020-03-19 15:19:00
subject: Re: Regexes and C

In article ,
Martin Gregorie   wrote:
>I spent more time than I should have yesterday trying to understand
>regcomp(), regexec() and regerror() well enough to validate a string
>containing an e-mail address string to make sure that: its structure is
>correct and neither the username nor the domain contains characters they
>shouldn't.

It's not clear to me that the full syntax of email addresses
can be represented in the regular languages.  Undoubtedly
a useful subset _can_, but in their full generality, you
may need a push-down automoton.

>The upshot was that I couldn't do it because I could not write a regex
>that would detect spaces in the address because apparently regcomp
>doesn't provide any way to anchor a regex to either end of a string, so I
>ended up with a negated regex that detects invalid characters in the
>string and hasn't a clue whether its syntactically correct:

What do you mean "doesn't provide any way to anchor a regex to either
end of a string"?  That's what the `^` and `$` metacharacters in the
regex are for, and they're fully supported by the library.

>This does the trick, but no thanks to the man pages regex(3), which
>describes the C functions, and regex(7), which describes the regex syntax.
>Both are poorly formatted, hard to read, and seem to have omitted useful
>information, such as the inability of specifying anchor points in strincs
>that DO NOT contain newlines.

Could you clarify what you mean?  '$' will match the empty string at
the end of a line, '^' matches the empty string at the beginning
of a line.  By default, the library ignores newlines entirely; they're
only significant if you use the `REG_NEWLINE` flag to `regcomp()`.

>So, can any of you do better, i.e. write a regex that CAN validate the
>syntax of an e-mail address in terms of its structure and the set of
>permitted characters on the username and domain parts (the permitted
>character sets are not the same).
>
>Also, if anybody can suggest a better tutorial on using these functions
>or suggest another, better, set of C functions for doing the same job,
>that would be wonderful.

Perhaps if you could post your code, one might be able to see
an issue?

As far as other libraries, if you can link against C++ code, the
RE2 library is very nice.

>PS: I did check my old reliable standby text - David Curry's "UNIX
>Systems Programming for SVR4", but it wasn't helpful in this case
>because, unusually, the set of functions in the C Standard Library have
>changed both names and parameters since it was written.

You'd want something that covers the POSIX interfaces.

 - Dan C.

--- SoupGate-Win32 v1.05
* Origin: Agency HUB, Dunedin - New Zealand | FidoUsenet Gateway (3:770/3)

SOURCE: echomail via QWK@docsplace.org

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.