TIP: Click on subject to list as thread! ANSI
echo: c_echo
to: AUKE REITSMA
from: DARIN MCBRIDE
date: 1998-02-12 22:03:00
subject: Comment parsing

 MS>> Comment extraction is not as easy as one might first think.
 AR> ...
 WM> Some pseudocode:
 WM>  Read source file until '/' or EOF
 WM>   If '/', check next character:
 WM>    If '/', read comment (C++) to EOL (optional)
 WM>    If '*', read comment (C) to '*/'
 WM>  Loop
 WM> What have I missed?
 AR> String constants. E.g.:
 AR>     char foo[] = " /* comment ??? ";
 AR>     char bar[] = " end comment */ ";
 AR> After that is handled:  char baz = '"';
If one can get all the states of a C-parser into one's head, eliminate 3/4 of 
them due to being irrelevant, and put them down into a state-graph, it 
quickly becomes simple.  Tedious, but simple.
I recently went through the task of finding links in web pages this way.  C 
is a little (anyone sense a bit of understatement here?) more complex, but 
not that much more so when you're limited to finding only certain things, 
such as comments.
Your baz example, for instance, would be handled simply by knowing when 
you've entered, and when you've left, a single-quote "string" (I can't recall 
the term - it can't be "character" because you have have more than one 
character in it).  Same for string constants - in and out of double quotes.  
Then you have to ignore the character after a backslash.  Finally, you 
disregard ALL of the above when you're in a comment.
read next byte until EOF
  if inComment 
    if is end of comment
      inComment := false
      append a cr to output
    else
      add byte to output
    endif
  else if backslash
    skip next byte
  else if inSingleQuote
    if is single quote
      inSingleQuote := false
    endif
  else if inDoubleQuote
    if is double quote
      inDoubleQuote := false
    endif
  else if is single quote
    inSingleQuote := true
  else if is double quote
    inDoubleQuote := true
  else if start of comment
    inComment := true
  endif
loop
There is some obvious expanding to do in the "start of comment" and "end of 
comment", but the basics are there.  This can quickly grow, however, to 
difficult-to-manage code where you end up being better off using yacc/bison 
or some similar tool.  I'm gonna have to learn one of these so I can improve 
my HTML parser... :-)  However, I won't just be able to pick one up off the 
shelf - I need it to produce Java code.  :-/
Just my 2 cents.
---
---------------
* Origin: Tanktalus' Tower BBS (1:250/102)

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.