Tuesday, June 1st

  • Re: above: The Solaris thing was something in the socket code.
  • Re: Wiki. Okay! Let's do this outside the firewall.

    Spent some thinking about next steps as well as shepherding some paperwork through here (and talking to Clint and Steve Z about it).

    (1) Can we demonstrate the ability to use Clint's Unicon visualization tools from Python? (2) Let's get inside the interpreter ... (3) Can we write a tool using the sys.profile hooks already there? This might the best thing to do to show the IRAD committee we can make progress.

    We can get inside the "old" profiler, but we can't track C routines. This might be the thing to do to get a tool going. but, the newer, "better" profiler is all written in C (hotshot) and we probably need to understand that. That's a little harder.

    My current proposal is that Clint and I work on (2) concurrently (so we can both understand the internals) while I work on (3) and Clint work on (1) so we can make progress across all 3 axis.

    The idea for a paper is, hopefully, that maybe we can do cross-language profiling inside of Python and the C extensions. Show we can do it and maybe even have some reasonable tools.

    For future monies, we can demonstrate that we can port some of Clint's Unicon tools, get some stuff going and say "look we can do it ... can we get more money to make this a real product for Python and X-Midas?"

    Thoughts? Clint? Derek?

    Hey, look at this paper. Maybe we don't need python hooks. http://www.python.org/pycon/dc2004/papers/1/presentation.pdf No seriously. Isn't this Ken Walker's thesis?

    There are some interesting points from the talk.

    (1) Converting python code to C "directly" with p2c doesn't give much speedup. It's getting rid of the dynamic dispatch. (2) Numeric is hard to use ... it would be nicer to use Lists (!) (3) Writing in C may not be the thing to do.

    Now Self (an old Stanford dynamic system) used similar techniques: specializing code for speed. Psycho does the same thing. The times I have seen "How to speed up Python" invariably offer two techniques: (1) Use C or (2) Make "x" a local instead of global or namespace. This talk validates that "lookup" and dynamic lookup are bottlenecks for our programs. Which may offer us more direction: Perhaps Python Hooks should be trying to time the dynamic dispatch?

    Wednesday, June 2nd

    StarKiller: From the context, a decision point sounds like it is any place where you have to choose, at runtime, among competing types/code branches. Really, all he wants to do is take code, look it, and get as much static information possible out of it so he can 'specialize' code to avoid all the extra checks. No LGB rule, no dynamic dispatch, no virtual call overhead if you don't need it. (Now if he could do this for everything, he might be on his way to solving the Halting problem which is pretty hard. ;)

    I talked to Clint last night, he agrees: it sounds pretty vaporous. Moreover, Ken Walker's work went back into Icon and was much more real and thought out. But yes, it's a good idea. Both Starkiller and Psyco use the same ideas.

    Philosophy: Consider, the Linux kernel is written in C because, according to belief, it needs to be as fast as possible and we are willing to spend the extra time writing in C for that. There is also a philosophy, which we seem to be adherent to for this project, is that most of your program can be written in Python then write the few deep parts in C/C++. These two philosophies are at odds with each other slightly.

    If the "write high 90%, write low 10%" philosophy were true, then the Linux kernel would be written in mostly Python. Is that a good idea? Maybe.

    We also believe that tight C/C++ data structures are important for speed (thus the whole C Linux kernel).

    I am not trying to go off into left field here but point out something: there's something missing in characterizing performance. Perhaps because a kernel is a different type of entity that a DSP app or database, this may not apply. But I don't think so. The whole StarKiller and Pysco python compilers point out that "code specialization is good", but how is this characterized? THAT is the question I think we need to answer. Gilbert points out that function dispatch could be a lot faster if they specialized for string lookup. How do we know this? We need tools for this.

    This reminds me of the early midas 2k days when memory was a bottleneck of shared memory CPUs, but we didn't have any tools, reallly, to determine it. It became true computer science: Postulate that malloc is the bottleneck, try experiments (rewrite massive amounts of code) to determine this. This worked, but it was a mess. Why didn't our tools point this out? What tools would we have wanted to point this out? (To revisit, the same problem happened with too much locking in the initial M2k baseline. What tools would have helped us find this?)

    Blue sky mode on: Wwaiting. We somehow needed to know we were waiting: for a socket, for a malloc, for a "too much locking" point. There seems to be a bunch of things we could wait for: (1) CPU: there are so many threads, a thread may just be waiting to be scheduled. The thread is waiting for CPU. Or a CPU bound job has the data but just can't process it fast enough. (2) Data: A thread is waiting for a buffer to be filled by DMA, disk, socket

    There seems to be a mutual relationship here. Either I am being waited for, or I am waiting for something. A CPU bound job is waiting for CPU, the results are being waited for by a consumer. In an ideal world, there is no waiting for data, no waiting for CPU.

    If we expand this idea to the StarKiller metaphor, we are waiting for Dynamic Dispatch meaning we are 'waiting' for resolution. of names/lookups. Why don't we see this in our tools? Because it is small grained, a very small amount of overhead for a hash table lookup at a single site, but amortized over the life of a program, it's huge.

    So, how can we find the Starkiller overhead? We would need to be able to see all dynamic dispatch events and charge them all to the same place. We need to be able to see that we are waiting for the dynamic dispatch. So we need to classify and aggregate. But that begs a different question, how do we classify? We know, maybe from our experience, maybe from some real "computer science" that dynamic dispatch is slow. But how could we have classified this apriori?

    Our profile tools tell us how much time we spend in a routine, they even tell us how long they are on the stack. We need to know how long we wait for a resource OR how long we are waited for.

    (Another thought: how do we "detect" inline code? This may not be relevant in a Python program, but a "sequence of code" (like function call overhead) is spread out over an entire program, but it's all the same kind of code: In this case, moving registers in and out of memory. If we "inlined" just a certain type of code to avoid this function call overhead, we can save. But again. How?)

    More blue sky: most profilers will give us::

          while (i<100) {
     90%    a[ i ] += sin(i++);
          }
    
     2%   u.lock();
     5%   u.recount++;
     3%   u.unlock()
    
    But we need to know lock got swapped out... we actually waited for 5 minutes until we swapped backed for an activity that was only charged 2% of time. Maybe we need "actual time vs. CPU time"? Context switches of the machine. How about a display like this? A box which is the total CPU time in the routine, with red being actual CPU time. A cpu bound routine would have this being all red, a waiter would be almost all blue (blue being the color of actual time).

    Okay, but how do you charge callees? Maybe a callee returns both the actual time and CPU time and we can view callees either way. Actually, we probably do so we can characterize what was the biggest wait point. I can imagine two graphics, a little box on the side with red/blue percentages or different sized arrows pointing in and out (the bigger the arrow the more time waiting).

    That tool might have been useful for the locking or malloc problem, but what about the dispatch problem.

    Thursday, June 3rd

    Dispatch problem. Been keeping this one on my brain. So, the time between you call a routine and actual enter a routine is overhead. For plain old C programs, this overhead would be be function call overhead (save context, goto new context). It would be nice if we could record call overhead. For C++ virtual functions, the overhead would be more (indirection to vptr table, then indirect call through vptr), as of course would python (LGB rule).

    So, "overhead" is the time between the call and the time you actually execute the first statement of the called function. For inline functions, that would be nothing (or maybe 1 or 2 statements for copying depending on how you count it), C functions would be the time to do a function call context switch, C++ virtual functions would be C overhead plus indirection. Python calls would involve time to run the virtual machine, do the lgb lookup.

    So, how about this: ideally we would like a little "progress bar" by every function with 3 colors in it: red being the amount of "actual CPU time", yellow being "call overhead", and blue being "time spent waiting": Add up all three times and you'd get the total amount of time spent in the function.

    Then of course, we have to be able to ask for agreggates of data: how much time was spent TOTAL in function call overhead for this section of code, or this call, how much time was spent waiting in routines named "fred", how much CPU compute was spent in "compute".

    So, I am imagining a tool like quantify, except that it's available at run-time. You can see callers, calless, the times called plus all the info we ask for for certain routines. We turn on or off as we run to see more or less info.

    More thoughts: The C extensions can't (at this point) be measured with the debugger. So, there seems to be this "can't measure" portion of the process. Maybe if a routine calls a routine that can't be measured, the "can't be measured" portion should be clear (white). This metaphor could extend through the whole program: we only want to measure certain routines and don't care about profiling the performance of ancilliary routines, so even those contribute to the total overrunning time of a function their "contribution" to the performance bar would be "white" meaning, ya some time was spent there, but we don't know how much was spent in waiting,CPU or overhead. Turning off and on these is of course, important.

    More thoughts (I am thinking way too much today): Clint and I had this discussion about threads and the architecture of using Alamo and Python in concert, how should they communicate? Socket? Co-routine? Shared Memory? Here's an important thought: WE WANT TO PROFILE WITHOUT TOO MUCH IMPACT ON PERFORMANCE. If an application does not use an entire 4 cpu machine, the profiler should sit in another thread/process and take "simple" events and turn them into the runtime monitor, ie., getting a "compute called at 4:45", "compute returned at 4:46" and take these events over into the concirrent monitor which takes them and "visualizes" them (sorts the events, puts them in a display), but for this to be a "good" system, it can't interfere too much... a truly concurrent task could do all the processing "orthogonally" and visualize without too much interference. Ugh. Is this going to be a problem because Python is not multithreaded? It sounds like we may have to have two python processes talk over shared memory. (Again, Python should be thread-safe/hot/re-entrant)

    Friday, June 4th

    Lost half a day yesterday because the pizza I had for lunch was bad. Ugh.

    "demoed" the ideas discussed in this Wiki page with Derek. Wish Clint was here to bounce ideas off of, but I will send him this page. Colors were important, they showed how I was thinking. Here's a little more formal (and hopefully clearer) summary.

    Things To time

    A display would have a little bar next to each source code line showing how it spent its time for each function. At the top of the function would be a progress bar made up of the various colors showing the "total" time spent in the function doing each activity.

    As Derek pointed out, we want a "demo at every point". Activity for Richie: using existing framework in Python, which only reports stats to file, make it so it reports at run time. Slowly add features to this: allow us to move profile events to another machine for visualization, allow us to move profile events to another process for visualization, create better and better tools.

    Plan of attack

    Intercept profile events, figure out how to report them at run-time. May involve crude TCL display Consider architecture for shared memory dispatcher so we can "offload" profile events to another thread or machine or process (Maybe at this point, We could talk to Alamo and Unicon) Figure out how to compute CPU time (timing unencumbered Python codes), integrate info back in Better displays

    Goal, try to keep a demo available at all times after initial capability.

    Tuesday, July 5th

    (1) Spent some time getting back into the game. Looked at Clint's Wiki and seeing what he's done. Spent some time trying out and getting used to the Wiki over here ... talked to Clint about a few dumb bugs. Seems to work now. (2) Talked on the phone with Clint and sent him the "better" shared memory files. (3) Discussions about what we need to send from Python to Alamo/Unicon. I need to send a C struct to Clint.

    Wednesday, July 6th

    Looking at documentation for "Extending Python". Some of the docs in there might be very useful for the documenting python subproject.

    Wednesday, July 15th

    Been busy with SSGS and other stuff at RRC. I have looked at the extendind Python Docs and think that making a new type "SharedPipe" for Python is the way to go. Clint came today and we chatted. One major thing we decided is that I need to do a TDY to NMSU to really get some stuff done. Thinking even as early as next week ... Building Unicon and Python and get them talking is what we want to do. We also mentioned maybe a "coverage" tool? Maybe this is outside the scope right now ... i thought it might be a useful project for a student ... Clint and I also looked at why "pthread_mutexattr_getpshared" compiles and links but doesn't work: returned error code 38 "not implemented". D'oh! Brought in Bob Arendt and we discussed possible solutions: Upgrading the kernel, upgrading to a i686 rpms, upgrading to an SMP kernel. I did a little google search and found some people who say "make sure you liknk with -pthread" ... the web page I found is http://www.webservertalk.com/message275669.html

    Thursday, July 22nd

    Been a good week. I finally waded through all the embedding and writing C Extensions documentation. Currently, I am building a new type called "Piper" from a "shm" module in Python. [This is a good milestone]. I am looking into "buffer" operations (perusing the C API, I found some docs on "buffers" which is what I want to use for the actuall memory). The good news is that I seemed to be able to create a shared memory module/type in Python. Before I come to Cruces, I want to have a memory module in hand where I can send some "buffers" over shared memory. Feel like we need three things: (1) make sure I can talk over shared memory from python to python. (2) I need to make sure I can send a "structure" (which I may have to make a new type for) of C event that UNICON understands (3) I need to work with Clint to make sure UNICON can use these same shared memory routines. Before I hot Las Cruces next week (1) must be done (2) I would like to be done and (3) I can work on when I get there. Configuring XMPY:
    Get the xmpy280.tgz.
    untar unzip
    cd xmpy280
    setenv XMPY_ROOT $PWD
    cd unx
    ./configure
    make -f Makefile_support
    source $XMPY_ROOT/unx/pyenv.csh
    

    Friday, July 23rd

    Derek and I built XMPY: It is python with Numeric, TK and Tix (extension to TK) modules built in. With this, we can try out wireframe.py if we get a chance next week. [If X-midas exists, then XMPY builds with -taso and some other options] I need to send to NMSU: Wireframe and some other demos to try to profile, the SharedMemory object interface in Python, and ... is that it? More discussions withg Derek (yesterday) about direction: I don't think I need to worry about the "buffer" objects: I just get a string and blit the information over. So I think I am ready to come to Cruces. Spent some more time today on the interface: got it so we can send some strings across, reading and such. Good enough for us to make progress I think. I found a memory leak as well which I immediately plugged: boy this C API stuff is kinda messy. ;) I will probably e-mail you (Clint) a whole bunch of files here in the next few hours. Be on the lookout!

    Monday, July 26th

    After a brief fiasco with Internet connectivity, Clint and I had some discussions. (See Clint's Wiki entry). My job for today is to try to be reporting statistics to a log file ... The question is, what does that mean? The profile module (the current 2.2.3 and is the same for 2.3.4 but will change in 2.4) is written completely in Python and is actually pretty easy to understand. Internally, it keeps a table of "timings" that it marshalls and writes to a file. It may be the case that satisfies our constraints, BUT what we really need to do (I think) is write out something UNICON can read (so in text or some very simple encoding). I think we will be inside this code very soon modifiying it ...

    Some more thoughts; One of the most important things that came out of writing the Shm module was understanding how the Python Type Objects work. These objects are basically giant "jumping off points" for creation/initializtion/destruction routines all call. This will be a PERFECT place to add new hooks besides the standard sys.profile hooks. I know that some of Clint's UNICON tools count (for example) all string creations. If we "wedge" in the tp_init method in the PyStringType object, we can intercept all string creations and get an idea/visualizatioj about how many were called!

    Wedge: An old Operating Systems term for a "linear string" of calling routines that, when finished, would call the next one. We can probably easily turn on/off wedging/instrumentation (esp. since Python has a giant global lock) by wedging routines WHILE PROGRAMS ARE RUNNING.

    So, I suppose my next task is to take the Python profile and write out a TEXT log file so that UNICON can read it.

    I am more convinced that moving the profiling to another thread machine or process is the right thing to do, and also why we don't see more of the profiling module. In my tests for above (I did get Python spitting out some text), the first few times I did anything, I messed up and I was profiling when I was putting things inside the strings! I.e., I was profiling the profiling .... And, it feels (based on the interface given to us) like turning off the profile hooks while we are doing profile work (for visualization) would be difficult or invasive, etc. Now, my guess is the newer profile modules in Python 2.4 are written in C (like the hotshot module from 2.3.3) and is a little better at "staying out of the way" of Python. BUT, I do think the right thing to do is to shoot information over to another process and stay OUT OF THE WAY of the the real Python appplication is running. Otherwise, we will be profiling the profiling the profiling ...

    I am currently downloading the Python 2.4 alpha to see how they do profiling there.

    Perturbation Drives C-based Instrumentation

    Even minimal experiments with producing instrumentation using Python (e.g. Profile module) shows immediate strong intrusion into the behavior being observed. The solution is probably to write the instrumentation in C. A corollary is that the analysis is the job of the monitor, not the instrumentation. Also: we are reminded disk I/O is potentially a lot slower than network or shared memory

    What events do we want to instrument (in C)? Time to go back and review our original goals/requirements meeting notes.

    Because we have to stringize everything, Richie is worried that the performance might be slow, but he has not demonstrated that this is actually a problem yet. The Alamo style of whole-program shared memory instead of a little window of shared memory might be cheaper since things don't have to be stringized. Comparing and contrasting the relative performance and capabilities of these two approaches might be a decent research paper.

    Tuesday, July 27th, 2004

    Been inside the 'profile.py' module using that for examples of how to use the sys.setprofile. I wrote a filelogger that basically put all events to a file, ala Clint. This forced us to hammer out a file format. The info we print may be excessive (for example, we actually don't need a lot of info when we "return" from a function because we always know the last function we returned from, or at least we could maintain it.), but a quick copy of a c string isn't that much info to pass around. (Again, this harkens back to the argument of putting Unicon and Python in the same address with different threads (or even multiple Unicons) so that more sharing could go on, but for now we are just trying to pass info around ...)

    We are currently going to use the SharedMemory module I wrote last week, but surprisingly the POSIX shm_open isn't implemented on every machine here at NMSU?? So, I am zooming over to a Solaris 9 machine .. currently I am porting OpenContainers to Solaris 9 (not much work, very easy) (remember, the SharedMemory abstraction depends on the OpenContainers).

    The hope is that by the end of the day, I will be passing some Shared Memory to Unicon.

    GRUMBLE! Building XMPY on Solaris 9 gave me some issues, but I think it's because we are using a "non-standard" environment (NMSU) where make and gcc are in very different places, so I had to go inside of makefile_install and "expand" the path so that it could find the cc compiler and make. Weird!!!!

    XMPY built on Solaris 9. OpenContainers builds on Solaris 9.

    Wednesday, July 28th, 2004

    Finally got a parking pass so I can come in earlier than Clint. ;) And with XMPY, which uses python 2.2, you can't build C++ modules, or at least it doesn't appear that you can using the "xmpy setup.py build" mantras. I have a couple of choices, build a "C only" version, upgrade to Python 2.3.4 (but then I have to go through the process of building Tk and Tix so that wireframe.py works) without XMPY, or try to get XMPY working with 2.3.4 (the latest greatest).

    Some good news: The shared memory stuff seems to work on Solaris (outside of Python at least).

    Currently moving the SharedMemory module so it works a python C module so I can build it directly under XMPY.

    GRUMBLE: Spent some time trying to get C++ to link the shared library under python 2.2 and couldn't get it to work (went in and hardcoded the SunCC compiler in unixccompiler.py and it linked with C++, but still had issues). Don't know why. I had this working with python 2.3.4. I am bailing on XMPY, I have already wasted half of today trying to build my extension module on XMPY, when I know it works under Python 2.3.4. It may be that Solaris 9 is killing me, because I don't think XMPY has been built under Solaris 9. What this means: I may have to manually install TIX and TCL and Numeric to use the wireframe.py demo, or I can make progress without that particular demo for now (which is what I am probably going to do).

    GRUMBLE: Okay and fought with and rebuilt python 2.3.4 with g++ and it appears I can finally link with the shm stuff on SOlarus BUT NOW readline doesn't work !! ARGHGH! Also spent some time chasing a seg fault: perror with ":invalid arg" to shm_open: you need a '/' at the beginning of your name on Solaris. The good news is that now I have a python that works with the SharedMemory module and Clint has a version of UNICON that works with it on the same machine. We should be able to talk soon!

    Tested and got a Python talking to Python on the same machine (but not with XMPY). Basically, I had to make sure that I used gcc and g++ (and configure --with-cxx=g++) to make sure it worked. Anyway, it seems to. [We did find a bug or two; on Solaris with POSIX shared memory, you have to have a '/' at the front and if you don't, the shm_open fails ... and I don't catch the error very gracefully. I need to fix that.] Talked to Derek last night about the Grad RA ship that we are funding, we would prefer an American citizen only so we can potentially hire him. Clint sent out an announcement to csgrads.

    Thursday, July 29th, 2004

    Found the number of the PLEASE lab where I am working: 646-3602. Derek, feel free to give me a call.

    After last night's success building, I tried XMPY again. Nope. The configure script for XMPY doesn't seem to use the --with-cxx=g++ flag. It automatically uses the default CC and cc for Solaris.

    Permissions issue: on Solaris, permissions are a bit sticky on the shared memory file, which lives in /tmp/SHMD* rather than /dev/shm. Setting the umask, and manually removing an older, permission-protected copy, were necessary to get writer/reader talking from separate userids.

    We couldn't get UNICON to Python to talk at first (we think there's a bug), so I backed off and tried having the python profiler talk to a python program. No problems there. This points to *I think* a build issue of how Python builds things and how UNICON builds things.

    [BTW, prstat is the perfmeter/top on Solaris 9]
    Okay, a few things: Moved to Clint's house and running on SUSE Linux. Got Python 2.3.4 built and then SHM extension module built. Works. Clint can do a UNICON to UNICON shared memory that works. We have found a few places where he was treating the opaque impl as a FILE* so that might explain why we had so many issues under Solaris. Some things to fix: All perrors need to throw an exception (some of the permission error ones don't). Unresponsive to CTRL-C: Need to make sure a CTRL-C will disconnect something some shared memory ... also need to add a piece of shared memory as a parameter so that we can tell when CTRL-C is pressed. Clint is currently tracking down the errors.

    Friday, July 30th, 2004

    So, Clint did a little work last night and got the Alamo framework working with our SharedMemory event. We had a little demo that visualized the stack frames: So SUCCESS! We got Alamo/UNICON?Python out the door. We did a few timing demos (on a single processor machine), but it is expensive. We demod a little fibonaci program and it took a while to run, partly (we think) because the formatting of the strings to send over shared memory is happening in Python ,as well as some oher hooks: We do a lot of work when we generate a single event, so I am going to revisit the speed a little. There are a couple of ways to handle that: first and foremost, we want to write a few more routines in C and avoid building excess strings, etc. That's for me to work on next week.
    Clint and I are deciding what to do next for tools. I think in about a week or so, I will probably want to do another TDY while Clint and I work independently on stuff we mentioned above.
    Right now, Clint is going to give me a demo of some more Alamo tools so wecan figure out what to do next. Come Monday, I am going to download all we have done this week and give a quick demo to show where we are.
    Some of the visualization tools are for lists: these might be good to work on, ... In order to get the next step, I have to be able to sending Clint the different types of list events ... he also has tools for Algae ... (it shows suspends in a hex map): We might be able to adapt that to show more history of a call stack.

    So, another one of my goals next week is to figure out how to send events from Python: Intercepting the TypeObject stuff.

    Clint and I need to "bargain" and figure out how to send him the events.

    Monday, August 2nd, 2004

    Half day today maybe?
    Clint uploaded our work, and I am currently downloading it to make sure I can build and run it here at Rincon. Then I can give Derek and and maybe Stephan a demo. Derek is in my office : we are chatting, we need to talk to Ervin about Grad students. Derek is thinking of some of the things we want to measure. Looking at the Events measured in Clint's Alamo is interesting ... but we want to know WHERE IS THE CPU going? Want to remember that we want LOW OVERHEAD. (I promised we clean it upand write in C to avoid Python overhead). Real time/ Low Overhead / Resource Budget

    Tuesday, August 17th, 2004

    Well, after 10 days of being in Denver, I am refreshing my cache. Clint has been doing work on the ULMA (Ultra Light Weight Architecture).

    Sunday, August 22nd, 2004

    Back into work. One of the things I am going to work on is to make the SharedMemory extensions more 'responsive' to Control-C events. Once I talked to Clint, we'll get more on track with with each other's work. My trip to Denver nuked me.

    Monday, August 23rd, 2004

    Taling with Clint, downloading ULMA.

    Tuesday, August 24th, 2004

    Got and built UNICON. In order to get the libshm.so in there, we had to change the LD_LIBRARY_PATH to include that from the sharedmem directory. Then we could build unicon. Next thing was to hook the 'shm' library into Python: Have to run the "setup.xxx" script under the shared mem directory: this puts a "build" directory with the shm.so that PYTHON needs: still not done, need to put the build/libxxx/shm.so in the PYTHONPATH so python can find it with an "import shm". (Also had to make sure we were using Python 2.3.4 so it supports C++). So, this allowed me to build and run python using the shared memory module, as well as unicon.
    The next step was running the algae demo: algae 'shmfoo' which reads events from the profiler: Currently we have a python program which runs "fact" called fib: It hooks into the profile events of python constructs an event and sends it over to the algae demo. Whoo! A lot of work for such a little demo.

    Showed Derek the demo, and he reminded me that we need to keep track of all the things we are doing so we can go back and show the IRAD committee.
    Next steps: (1) Make instrumentation faster .. use C routines because the python instrumentation is SLOW!
    (2) Instrument the TypeObjects and bring some more Alamo tools to the forefront
    Instrument virtual machine? Timings? One of the final products we want is a "visualizer". Who builds this? Who pays for it? We at Rincon know what we want. Do we go after the Python.org grant? Do we use one of "Rincon's" staples of money and go after that? I need to talk to Ervin tomorrow.

    A reminder for Derek: The SharedMemory routines we are using for ULMA are the ones the new pipe system will be using in X-Midas: for warmstart So this can be a "feather" in the AI's reuse cap. (Similar for OpenContainers which was recently approved for OpenSourcing): We are using libraries more and more rather than big monolothic frameworks.

    Monday, August 30th, 2004

    Came in Friday, did some quick checking in with Clint. On Saturday, we discussed future plans: We WANT to write a grant proposal for Python.org: Later this week, we will figure out what to do and start writing. Now that I am not going to Ireland, this isn't as big a deal (I thought we were going to have to do this before I left for Ireland). Here's another note: Another grad student pointed me to http://lxr.linux.no: Someone has a source code browser for the Linux kernel. Two things come to mind here: (1) Should we have something like this for Python? Should we try to install it? (2) Heck, could it be useful for XMPY source trees? Allright: Goal for today: Get the current python profiling using the C interface (to avoid all python overhead when the profiling code is called). Hopefully easy, but have to make sure we can rebuild everything in UNICON and Python. This also means I will have to build another "C Module" and link it in.

    How to get C string info out (from stringobject.h)
    /* Macro, trading safety for speed */
    #define PyString_AS_STRING(op) (((PyStringObject *)(op))->ob_sval)
    #define PyString_GET_SIZE(op)  (((PyStringObject *)(op))->ob_size)
    
    
    


    Have something that I think may work so we can use the C profile hook ... I need a "working" system to try this out under, so I am going to build things under Solaris. Unfortunately, we never got things working under Solaris (linux), but there's good reason to think it will.
    Grumble ..hate Solaris! Exact same problems as last time. So, we went ahead and moved back to Linux at Clint's house (Fedora Core 2). We got the demo working, and discussed ways to clean up the build process so Stephan and Derek could build it without us in the future.
    Also thought about "determining" speed: How much faster will the C hook be? So right now, on Clint's machine: 50% of the CPU is for X, 28% for Unicon and 22% for Python.

    Food for thought: sprintf is slow because ... it has to parse the format string and deal with tons of issues. Do you know of a standard way (that's FAST) to convert a string to a number? The only way I know is covered in C Traps and Pitfalls: Write it your self.

    Tuesday, August 31th, 2004

    All set up. I am hoping by lunch to have the algae demo working, but with the C interface, so it'll be a lot faster. I also have to meet with Mike today to discuss what we are doing next.

    Ah! Now I understand the section of the Python documentation where they talk about "how to call C modules": The Piper stuff is over in one module, and I want to write another module which will be using Piper stuff. Amazingly, this is quite convoluted to do: The Python docs go through an dizzying array of things you must do to get this to work. I think I will simply move the routine inside the module for now to make sure it works. (Seriously, it's PAINFUL to set up C calling other C because of dynamic loading issues, name clashing etc. I think it's probably because one or two platforms don't support it, and they have to go out of their way for it).

    Hot dog! Got the C Profile hook working with the Algae demo. A preliminary "top" look shows 48% X, 40% Iconx, 10% python ... so that's a good sign! It shows that the bottleneck is the graphics, and our % time in Python has shrunk from 22% to 10% for how much CPU it uses. I will time it, and post the results comparing the "python hook" and the "C Python Hook".

    For the CPython hook: X taking 48%, Iconx taking 40%, Python taking 10% gives:
    time python testshm.py
    done 10
    done 20
    2.522u 0.775s 0:32.14 10.2%     0+0k 0+0io 0pf+0w
    


    For the Python hook: X taking 40%, Iconx 34%, Python taking 20% gives:
    [jeffery@localhost python_code]$ time python testshm.py
    done 10
    done 20
    6.987u 1.289s 0:39.05 21.1%     0+0k 0+0io 0pf+0w
    
    Result: Slower by about 20%, and user time was twice as much in the Python hook. On a multiprocesser, when things could proceed indepently, we could probably (pure speculation) get the C version going twice as fast?

    Demoed to Clint, we conjectured on Speed: I am using sprintf to "build" a string, and that is taking us quite a bit of time! I am going to write that by hand, or just send the int.

    Wednesday, September 1st, 2004

    Reeling with joy over last night, I realize that what we need is a 2-CPU machine to really test this out on ... I will have to bring the demo back to Rincon and try it out on a DL380 or something.

    Today: Plan for today, Start trying to get events for List or String events. I have an "idea" what to do, get in an intercept some of the TypeObject calls, but need to figure out how to get those out. When this is "done", I head back to Tucson. I can't leave Cruces without getting another Alamo tool done! ;)

    Before we leave this entirely, Here's another timing test. Here, we've taken the sprintf and replaced it with 4 bytes (an int_u4) for the line number. Now, under top: X is 50%, Icon is 40%, python 8%
    time python testshm.py
    done 10
    done 20
    1.736u 0.967s 0:29.99 8.9%      0+0k 0+0io 0pf+0w
    [jeffery@localhost cprof]$
    
    about 1/2 the user time, about 5% faster: Obviously, X is the bottleneck! Again, on a multiprocessor, when X, Icon, and Python can proceed independently, I think we'll see a major speedup.

    Let's come at it from the other direction: Time how long UNINSTRUMENTED costs. Here's fib: we have to add the fib(30) cause it's so fast:.
    
    time python testshm.py
    done 10
    done 20
    done 30
    3.094u 0.040s 0:03.28 95.4%     0+0k 0+0io 0pf+0w
    
    
    Now, we'll JUST turn on the C profiling events: Nothing happens except an empty stub gets called.
    time python testshm.py
    done 10
    done 20
    done 30
    3.579u 0.031s 0:03.75 96.0%     0+0k 0+0io 0pf+0w
    
    Now, let's build the event to send, but not actually send it (drop it on the floor)
     time python testshm.py
    done 10
    done 20
    done 30
    3.860u 0.029s 0:04.03 96.2%     0+0k 0+0io 0pf+0w
    
    Now, let's send the event, but the other side is ONLY reading the event as fast aspossible and dropping it on the floor.
    time python testshm.py
    done 10
    done 20
    done 30
    29.316u 34.531s 2:26.68 43.5%   0+0k 0+0io 0pf+0w
    
    Strip the event to 4 bytes and other wise is reading and dropping on the floor.
    time python testshm.py
    done 10
    done 20
    done 30
    29.849u 34.733s 2:24.28 44.7%   0+0k 0+0io 0pf+0w
    


    So what does all this tell us? That there is a fundamental bottleneck in the shared memory mechanishm: It might be the implementation on this platform of CandVars, it might be a fundamental flaw on my part, it might something else. Our intuition tells us that the shared memory should only introduce an extra 1x (from 3 seconds to maybe 6 seconds) for all the extra memory copies, etc. But it seems to be 40x slower!!! What gives? Clint and I discussed this, but we need to get some more Alamo tools going. I ABSOLUTELY have to fix that before we can pronounce any success. But, the good news is that the C hook is in there and that gave us an extra 2x right off the bat.



    Moving on... object.h has this to say about Object and TypeObject thingees.
    /* Object and type object interface */
    
    /*
    Objects are structures allocated on the heap.  Special rules apply to
    the use of objects to ensure they are properly garbage-collected.
    Objects are never allocated statically or on the stack; they must be
    accessed through special macros and functions only.  (Type objects are
    exceptions to the first rule; the standard types are represented by
    statically initialized type objects, although work on type/class unification
    for Python 2.2 made it possible to have heap-allocated type objects too).
    
    An object has a 'reference count' that is increased or decreased when a
    pointer to the object is copied or deleted; when the reference count
    reaches zero there are no references to the object left and it can be
    removed from the heap.
    
    An object has a 'type' that determines what it represents and what kind
    of data it contains.  An object's type is fixed when it is created.
    Types themselves are represented as objects; an object contains a
    pointer to the corresponding type object.  The type itself has a type
    pointer pointing to the object representing the type 'type', which
    contains a pointer to itself!).
    
    Objects do not float around in memory; once allocated an object keeps
    the same size and address.  Objects that must hold variable-size data
    can contain pointers to variable-size parts of the object.  Not all
    objects of the same type have the same size; but the size cannot change
    after allocation.  (These restrictions are made so a reference to an
    object can be simply a pointer -- moving an object would require
    updating all the pointers, and changing an object's size would require
    moving it if there was another object right next to it.)
    
    Objects are always accessed through pointers of the type 'PyObject *'.
    The type 'PyObject' is a structure that only contains the reference count
    and the type pointer.  The actual memory allocated for an object
    contains other data that can only be accessed after casting the pointer
    to a pointer to a longer structure type.  This longer type must start
    with the reference count and type fields; the macro PyObject_HEAD should be
    used for this (to accommodate for future changes).  The implementation
    of a particular object type can cast the object pointer to the proper
    type and back.
    
    A standard interface exists for objects that contain an array of items
    whose size is determined when the object is allocated.
    */
    
    Why is this important? (1) One question Clint and I had: Do objects in Python move around? The answer is NO (according to the comment anyway). Why is this important to us? When reporting an event from Python to Alamo, we normally have to send "some info", esp. to distinguish lists from each other. Well, we can just use the pointer as a handle to the list so we know which list we are operating on without much work. (2) Another reason this is important: More discussion of TypeObjects. Any and all discussion of those things is good.

    After reading through some source code, I have discovered something a bit disconcerting: there seem to be many ways to access a list! I am just concentrating on a single "get item" idea (a single subscript) I was really hoping that everything jumped through the TypeObject, but there are lots of helper routines that seem to be accessed through a multitude of places. What I really need is a "thorough" analysis and document, but here's what I see: (1) The "listobject.h" gives way to do it PyList_GetItem. This is listed in the include file, so it appears that this is public API: i.e., yes, people can use it (and a grep through some VM code shows it being used). (2) Do a grep for "ob_item[" in the Objects/listobject.c: You see (besides PyList_GetItem) many many uses, and that's how you get an object off the list. (3) Just for completeness, there IS a jump through the TypeObject:
    static PySequenceMethods list_as_sequence = {
     (inquiry)list_length,   /* sq_length */
     (binaryfunc)list_concat,  /* sq_concat */
     (intargfunc)list_repeat,  /* sq_repeat */
     (intargfunc)list_item,   /* sq_item */
     (intintargfunc)list_slice,  /* sq_slice */
     (intobjargproc)list_ass_item,  /* sq_ass_item */
     (intintobjargproc)list_ass_slice, /* sq_ass_slice */
     (objobjproc)list_contains,  /* sq_contains */
     (binaryfunc)list_inplace_concat, /* sq_inplace_concat */
     (intargfunc)list_inplace_repeat, /* sq_inplace_repeat */
    };
    
    Notice that list_item and list_ass_item are "helper" routines that DO NOT call PyList_GetItem/PyList_SetItem. The point here, accesses can happen ALL OVER THE place.

    So, here's a brainstorm: what if we rewrote in C++ so that ob_item[] overloads operator[] and intercepts that? That would require the least amount of invasion.

    Here's another thought: Maybe Python SHOULD be unified so that all accesses are through such jump tables. Sigh. We all know THAT's not going to happen. But, it may be another idea for a proposal. Clint's Ph.D. was going into the virtual machine, but he said he had some "help" in the form that there were already instrumentations in place, he just took advantage of them and started visualization of them. Our job may be harder is some cases because Python has a "large developer" community, Icon was basically just a few people at U of A (Gregg Townsend, Ralph Griswald, etc.). However, I still believe that some of the TypeObject indirections may pay off.

    Okay, so after some discussion, we are going to go after different fruit: we could probably visualize lists and sequences, but that feels a bit invasive. There is an alamo tool which tracks memory (a couple) some that track creation/deletion of types (which I think we can still track): This is still in the TypeObject.

    Thursday, September 2nd, 2004

    In the PLEASE lab working today. A few thoughts occurred to me while working in the shower today: The right way to deal "avoiding synchronization" in the pipe calls is to ask for two things: A buffer of contiguous memory, and the number of bytes we can copy into it. Have the Piper "remember" these things so that everytime we can proceed with out synchronization. We only have to synchronize then when we reach (1) full buffer or (2) end of the contiguous buffer (and have to wrap). This will hopefully get us back in the ballpark for the speed we need. There might be some other things to rewrite: We actually search 16 locations for "active" pipes: Maybe one array and one linked list? (Ugh, linked lists mean synchronization). Anyway, I have to play with this next week. We still think this is important for building a fast, responsive profiler: We have to make sure we do as little work as possible copying the events to another process to deal with.

    So, more looking at listobject.h,.c files. There seems to be two ways that lists can be created:
    PyList_New:  A "global helper" routine available in the listobject.h.  Since it is exposed, I guess anyone can create one.  If you start looking through
    listobject.c, you'll notice it getting called from multiple points for the
    list helper routines.  It seems to be an "easy" way for the VM to create
    a list without jumping through the TypeObject .
    Jump though Type Object:  A sequence for creating/destroying a list would be 
     PyList_Type : tp_new (PyType_GenericNew): which only calls type->tp_alloc
                   tp_alloc (PyType_GenericAlloc) : allocates and zeros memory
                   tp_init (list_init) : Dizzying array of calls to list_fill 
                                         and list_ass_slice.  There is a
                                         macro which resizes lists NRESIZE
                                         as well as other random assortment.
                   (List is now created ... some later time is GCed)
                   tp_dealloc(list_dealloc) : PyMemFREE on ob_item 
                                       also call op->ob_type->tp_free on object
                   tp_free (PyObject_GC_Del) : (over in Modules/gcmodule): 
                                       calls PyObject_FREE on PyObject
    
    Okay, this brings up a similar question from the investigation the other day. Clint calls this the difference between "source level" events and VM event: The PyList_Type sequence is called probably when a piece of Python code creates List. The other routines, which create sublists, etc., seem to be more "internal helper" routines for creating lists WHICH ARE STILL LISTS!

    Again, not as easy as I had hoped, but this seems to have more hope than tracing the get_item events for lists. What BOTHERS me is the NRESIZE which seems to happen all over the place (does the resize of a list count as a new list?). Also, the sort routine scares the crap out of me: There is a LOT of code for sorting. It IS a merge sort, so is it creating new lists? I really don't like that code: I understand that it may be fast, but I have seen faster sorts that are much easier to grok. It's close to lunch, so I can ask Clint about this. Here's my recommendation: First, get some instrumentation on one of the TypeList methods: We can do this without changing any Python code, we can wedge a routine in tp_new or something. See if we can get Alamo working with this. Then Maybe CHANGE the Python VM (gasp!) with Macros so that PyList_New also has instrumentation (add a wedge routine?). [Oops, also want to instrument tp_free, so we know when the list disappears].

    But I still have to do analysis for string objects, table objects, etc.

    Huh. I made the changes and we only seem to call the "method" table if we explicitly say "a = list()" or "a = list([1,2,3]). Something like a = [1,2,3] WILL NOT show up as an event even though it's a list creation. I have been searching through the VM code and PyList_New code is EVERYWHERE. Sigh. Maybe I do need to change the virtual machine?

    Before that, I still need to instrument deletions of lists. I have more hope that is through the TypeList object because nothing in the "listobject.h" talks about how to destroy lists.

    Ok, sucessfully intercepted "tp_dealloc" events. Also successfully intercepted PyList_New events after recompiling Python ... just added a single "if" check inside PyList_New: This meant changing listobject.h,.c and recompiling. Seems to work.

    Now that I am intercepting these events, we have to integrate these events into the shared memory events and ship them to UNICON like we did with Algae. We have to go to Clint's house todo that...

    Friday, September 3rd, 2004

    Got food poisoning last night. Still trying to recover.

    Got the code moved over and found a few bugs that caused seg faults: But we are all fixed now. Got the list events printing. Next, easy milestone, need to start sending the events to the SharedMemory sections.

    Sunday, September 5th, 2004



    So, wrote a little reader and am now sending the events to shared memory. There they go! Structure: a list event is 12 bytes: 2 bytes if length (which is the length of the REST OF THE MESSAGE), 1 byte of event type (l for List construction and L for list destruction) then 4 bytes of "handle" (the front of the list which AS GUARANTEED by the Python implementation, will not move) and 4 bytes of length and 1 final byte of type (which will fill in later).

    Clint is currently putting the finishing touches on an Alamo visualization for these events.

    So, what's next? So things loom: Get a proposal written for the Python.org, Get the shared memory up to speed, move Alamo and Python into the same address space so they can co-exist better,

    Clint and I added more information: The "typestring" associated with the object in the first element of the list. However, for PyList_NEW: It's always empty! It's not like Icon, where you give the initial value of all elements in the list. We were hoping to visualize the list better. Anyway, the information we are sending is the variable length (no longer 12 bytes) message which includes the typestring (which is always EMPTY right now for creation, but of course for destruction, we know what it is).

    Tuesday, September 7th, 2004

    Talked with Derek, gave him an update of status. I guess there's still some money in the IRAD, and Derek wants to be more involved with writing code (cool beans!). We talked about him mabe helping write some native Python visualizers? We'll chat.

    Today: (1) Need to talk to Ervin about grants. (2) Get the new Demo working to show Stephan and Derek (3) Clean up makefile process so more relative paths (4) Get the shared memory MUCH FASTER (I think I know how to do this) (5) More analysis to see what other events we can go after.

    Downloaded ULMA2, building and preparing to show to DTJ and SAF.

    Got unicon built, got the algae demo working. Got the cprof working, so I am using c hooks. Currently doing some timing testing:

    2_CPU machine, (aidl380-01), one process generating fibonacci events to another process which just throws away the event (NULL reader). Using python hooks (not c profile hook) 24.9% CPU for python, 10.8% for lreader ... and that's it.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    73.220u 12.410s 1:25.77 99.8%   0+0k 0+0io 487pf+0w
    


    2_CPU machine, one process generating fibonacci events to another process using C Profile hooks. Same scenario, but using C profile hook. 24.9% python, 24.9% lreader.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    9.290u 15.430s 0:24.88 99.3%    0+0k 0+0io 492pf+0w
    
    So, the C profile hook seems 3 times faster (wall clock time), but the amount of user CPU is almost 10x difference!


    Let's come from the other direction again: Time JUST the fib sequence, no profiling at all.
     
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    1.460u 0.010s 0:01.49 98.6%     0+0k 0+0io 454pf+0w
    
    Turn on C profiling, but ONLY the subroutines, don't actually do anything
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    1.610u 0.000s 0:01.62 99.3%     0+0k 0+0io 489pf+0w
    
    Adds very little to the time, even WITH all the events (fib(30) is a LOT of events).

    Now just build the event, but dont' send it over shared memory.
    aidl380-01 % !ti
    time python testshm.py
    done 10
    done 20
    done 30
    1.920u 0.000s 0:02.19 87.6%     0+0k 0+0io 489pf+0w
    
    A little more time, but not too bad.

    Now, have a NULL reader on the other end, just reading and throwing away the bytes. Okay: Here's part of the problem, both should be taking up 10o% of a CPU and they are NOT! Both only take up 24% each.
    
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    9.180u 18.790s 0:28.31 98.7%    0+0k 0+0io 492pf+0w
    
    So, 10x slower ... which is better than the 40x slower we saw, but we should be at worst 2x slower. Sigh. Yes, synchronization is KILLING US!!

    After more thought, the shm as it stands was written for LARGE SIZE transfers, not teeny transfers. That may be part of the problem. It needs to have some special cases for small size transfers?

    ARGH!!!! Sometimes I hate "TOP"!!!! The old versions used to report processes when they were >100%, but the new version by default WON"T!!!!!! ARGH!!!! So, some new % to report (WHICH MAY INVALIDATE THE % in the PREVIOUS TESTS BACK AT RRC!!!!!!! Note that the timings are still good). Running on a machine where python reports to an empty null reader IS 25% each, but on a 4_CPU machine, that IS a full CPU each... an a LOT of time is being used by the SYSTEM (context switching), almost 100%!!!! Here's the actual snapshot:
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   19.5%    0.0%   30.6%   0.0%     0.0%    0.1%   49.7%
               cpu00   10.0%    0.0%   16.6%   0.0%     0.0%    0.0%   73.4%
               cpu01   11.6%    0.0%   21.2%   0.0%     0.0%    0.4%   66.8%
               cpu02   30.4%    0.0%   50.2%   0.0%     0.0%    0.0%   19.4%
               cpu03   26.2%    0.0%   34.4%   0.0%     0.0%    0.0%   39.4%
    Mem:  3094804k av,  977048k used, 2117756k free,       0k shrd,   43340k buff
           499652k active,             260908k inactive
    Swap: 3145656k av,       0k used, 3145656k free                  611604k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    20980 rts       25   0   844  844   728 R    24.8  0.0   0:25   3 lreader
    20981 rts       25   0  2772 2772  1672 R    24.7  0.0   0:26   2 python
     2330 wal       15   0 45452  44M  2508 S     0.1  1.4  52:36   1 Xvnc
     2421 wal       15   0 13916  13M 10932 S     0.1  0.4   0:12   3 kdeinit
        1 root      16   0   512  512   452 S     0.0  0.0   0:05   2 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   2 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
    
    
    So, this IS actually (partly) what we want: Two CPUS being totally maxxed which the twp processes talk. Of course the problem is that an entire CPU of time is being used for context switching back and forth. Ugh. It would be great if we could do coroutines like we did in ICON (which Clint found to be 10x faster than his POSIX thread version of co-routines). Part of the issue (discussed earlier) is that it wakes up, checks then imeediately goes BACK TO THE OTHER PROCESS!

    Anyway, that's why this is research, eh?

    Talked to Clint on the phone: We had to "unicon alamo.icn" to make sure it worked: Apparently, this wasn't rebuilt as part of the makefile? Anyway, go the radial thing working. Great! Another demo!

    Also got some news: Micheal found some more instrumentation points. ! GREAT!

    Wednesday, September 8th, 2004

    Today: Going to try to get the speed up up the cprofile.

    Demoed to Bob Arendt: he thought the visualizations are pretty but need more info. In particular, what he wants is "something like top" where the runtimes are displayed while the program is running. This is the SECOND time I have heard this in two days! Derek said the exact same thing the other day. Perhaps THAT is exactly what we need to be writing the Python people about? (As I say this, I am realizing I need to spend some more time looking at the hotshot c cprofiler ...)

    Thursday, September 9th, 2004

    Thought more and more about how to fix up the shared pipe and came up with a double buffering idea: Don't synchronize except on full buffers (i.e., end a whole buffer over). Ideally, we'll be able to tune "automatically" so that the double buffer can be triple/n buffered for the reader or writer, depending on who is the faster/slower one. Have something going right now, but of course it is seg faulting.

    Friday, September 10th, 2004

    This morning, got the "reader/writer" demo working: Currently we have the same interface as the Piper so I should be able to just plug it in "as is" to the profile stuff I did before. I can actually probably make it faster by writing directly to the buffer (rather than building the event THEN copying it to the buffer), but let's get it working first then tune it some more.

    Allight, we got it working with CProfile hooks and a simple double buffering scheme that buffers the events and does minimal synchronization. Top shows: 24.9% time in python (1 full CPU), 10.8% time in lreader (the NULL reader) and maybe 2% time in system time! We are much much much faster than before!
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    28.440u 0.360s 0:29.03 99.2%    0+0k 0+0io 481pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    
    Compare this to NO PROFILING HOOKS AT ALL!
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    17.640u 0.030s 0:17.94 98.4%    0+0k 0+0io 441pf+0w
    
    So we're about twice as slow with hook. (odd, it's about a full CPU plus another 25% of a CPU? That might just be top).

    Let's time the cost (again) of the event instrumentation without ANY WORK being done (just immediate return).
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    18.670u 0.010s 0:18.87 98.9%    0+0k 0+0io 477pf+0w
    


    Now let's time it with building the event, but NOT putting it to shared memory.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    23.640u 0.030s 0:23.90 99.0%    0+0k 0+0io 477pf+0w
    


    revisit the double buffering solution.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    28.110u 0.540s 0:28.71 99.7%    0+0k 0+0io 481pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    


    Gave demos of nova to Derek and and Stephan. Afterwards, Derek and I did a lot of work and had some new ideas.
    (1) Add a "tics" number as a timestanp to the event:  Rather than have the profiled
        profile set off timers/turn them off, send over a number (quick to read of current time) and have
         the monitor use ITS CPU to compute times with it
    (2) Have a single step mode
    (3) Dial for frequency of reporter and monitor
    (4) Instrument dictionaries
    (5) "Make TOP" for Python Routine usage
    (6) Go through the interpreter for cleaning up?
    
    So, Derek and I demoed wireframe using the "nova" demo as well as the "buildtree" demo. Very cool! We had to build tk and tics and such (it was good to see how to do that outside of XMPY) and then just used the tools and they worked! Derek was disappointed that the hot connect/disconnect doesn't work yet. That's still to do: The original idea was "let's get it going" first, then add features. Right now, we have a couple of things we can demo. But of course, we want better demos and more robust demos.

    Monday, September 13th, 2004

    For part of today, I want to build the event directly to the shared buffer. This way, we "should" be able to get event recording cost down to the level BEFORE we were event putting it to shared memory.

    Here's with building the event directly in memory (without an intervening buffer and extra memcpy). Difference: -O5
    
    
    aidl380-01 % !tim time python testshm.py done 10 done 20 done 30 done 35 25.960u 0.450s 0:26.58 99.3% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 26.280u 0.410s 0:26.90 99.2% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % make -f Makefile.linux clean /bin/rm -f *.o *~ libcprof.so *.so ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % make -f Makefile.linux g++ -g -O3 -DLINUX_ -DOC_BYTES_IN_POINTER=4 -DOC_NEW_STYLE_INCLUDES -Wno-deprecated -I../opencontainers_1_3_4/include -I../Python-2.3.4/Include -I../Python-2.3.4 -I../sharedmem -c cprof.cc g++ -shared cprof.o -L../sharedmem -lrt -lshm -o cprof.so ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 27.430u 0.440s 0:28.01 99.5% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 26.290u 0.330s 0:26.63 99.9% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 %
    The recompile in the middle moves us back to -O3 so we can compare like results. Hm. I am surprised: We aren't getting as much speedup as I hoped from building the event directly to the buffer.

    Here's a summary:
    29.010u 0.520s 0:29.71 99.3% : CReport Func, Build Event lookaside, DB put
    26.290u 0.330s 0:26.63 99.9% : CReport Func, Build Event Inplace, DB put
    23.640u 0.030s 0:23.90 99.0% : CReport Func, Build Event lookaside, NO PUT
    18.670u 0.010s 0:18.87 98.9% : CReport Func, nothing else
    17.640u 0.030s 0:17.94 98.4% : NO INSTRUMENTATION
    
    Let's try one more test, let's "pretend" we can just pass the pointer to the stack frame of interest so that building the event is "trival" (8 bytes of something). Of course, this would require mapping the Python heap into our process so we could see it ...
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    24.080u 0.150s 0:24.42 99.2%    0+0k 0+0io 474pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    23.630u 0.070s 0:23.88 99.2%    0+0k 0+0io 474pf+0w
    
    Suprisingly, not much better. We could be getting down to where we have to start tweaking the call site. I suppose the real question is, can we get the same time as "CReport Func, nothing else?" Right now, we are still off that mark. So, we move ALL code to the function that's gets called, pass 8 bytes (2 bytes length, 2 bytes of "code" of event and 4 bytes of "frame": The pointer to the frame).
     
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    21.560u 0.190s 0:21.99 98.9%    0+0k 0+0io 474pf+0w
    
    That seems to be the best we can do. What if we make this module -O 5?
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    21.460u 0.150s 0:21.62 99.9%    0+0k 0+0io 474pf+0w
    
    Little bit of a difference? Seems to be in the noise.

    Let's build the event a little bit smarter, at absolute addresses so they can be done in parallel.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    21.860u 0.090s 0:21.95 100.0%   0+0k 0+0io 474pf+0w
    
    No diff. We are down in the noise. We can't get much faster and still do work.

    So here's the problem with "just sending the frame pointer over": We need to up its ref count so that that frame doesn't "disappear" over in the other process too early. So, that's cheap: an increment. But the question is, when do we DECREMENT the refcount? The other process could do it, but only if it held the interpreter's lock (which stops the monitored program from making progress). Alternatively, the monitor could "feed back" a bunch of references it has to DECREF count on, but that would slow the program by 2x (as it would have to read another data stream). Actually creating the event wasn't THAT expensive! Here's a more radical option: Have a "list" of frames that's collected "every so often" (second? When buffer fills? Both?) and assuming the monitor "retrieves" the frame before it dies ... definitely timing dependant, which sucks and is unsafe programming: BUT: A "top-like" program wouldn't care too much: It has a "every second" update where the event has to be processed anyway? Hmm ... still don't like this. Anything that "defers" the deletion seems to suffer from "making the working set of the process bigger" which screws up cache behaviour, yada yada. Hm. I rememeber something about "weak references" in the Python documentation. I wonder if those could provide some answer?

    Tuesday, September 14th, 2004

    To answer the question about "weak references": No, that's not what we want. The idea is kind of relevant (keep a reference to something, but don't make it count towards prohibiting the thing from being garbage collected), but we were more concerned with "multi-process" reference counting and we DO want to keep the frame around until we are done inspecting it.

    This morning I have been investigating Derek's quite brilliant idea: call getttimeofday() and just submit the "number of ticks". Ideally, this would be "one instruction" (a LDA from a location in memory with the number of clock ticks), but we can't gurantee how a clock is implemented on a system. After doing some preliminary investigation, I am going to try "getttimeofday": It seems mighty light-weight, but we won't know until we try it.

    Decided to look a little closer at the "_hotshot" high performance profile module. It appears to write to files as well. (I need to look at it closer). What IS interesting is that it use the GETTIMEOFDAY (macro, its gettimeofday on UNIX, some Windows thing on that platform). That makes me believe that getttimeofday is the fastest way to get profile information. I still want one instruction though.

    Ouch! I took the last version of cprof (where I am just passing pointers at the call site: This is the fastest we can possibly be for profiling) and just added a call to "getttimeofday" to see what the cost is (this is on a linux platform).
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    35.720u 22.040s 0:57.89 99.7%   0+0k 0+0io 475pf+0w
    
    We are 3x off of our performance JUST by the call to gettimeofday. Why? See below ...
    3:54:38  up 7 days,  5:20,  9 users,  load average: 0.25, 0.18, 0.07
    94 processes: 92 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   19.0%    0.0%    9.9%   0.0%     0.0%    0.1%   70.9%
               cpu00   60.6%    0.0%   39.4%   0.0%     0.0%    0.0%    0.0%
               cpu01    6.0%    0.0%    0.0%   0.0%     0.0%    0.2%   93.8%
               cpu02    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
               cpu03    9.4%    0.0%    0.2%   0.0%     0.0%    0.4%   90.0%
    Mem:  3094804k av, 2040724k used, 1054080k free,       0k shrd,  165572k buff
           695088k active,             703740k inactive
    Swap: 3145656k av,       0k used, 3145656k free                 1401556k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    17010 rts       25   0  2552 2552  1612 R    24.9  0.0   0:09   0 python
    17009 rts       20   0   888  888   764 S     3.8  0.0   0:01   3 lreader
        1 root      15   0   512  512   452 S     0.0  0.0   0:08   1 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   3 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
    
    
    "Top" tells the story: We are spending 39.4% time (+ another 9.9%?) inside of the system. No no no! That's WAY too expensive. We are context switching into the kernel to get the times.

    Let's try replacing the call to "gettimeofday" with "clock()". Difference?
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    35.530u 14.660s 0:50.27 99.8%   0+0k 0+0io 476pf+0w
    
    "Top" reports (while it is running)
    4:18:18  up 7 days,  5:44,  9 users,  load average: 0.31, 0.08, 0.06
    92 processes: 90 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   22.7%    0.0%    7.5%   0.0%     0.0%    0.0%   69.6%
               cpu00    0.2%    0.0%    0.0%   0.0%     0.0%    0.0%   99.8%
               cpu01    5.2%    0.0%    0.0%   0.0%     0.0%    0.2%   94.6%
               cpu02   69.8%    0.0%   30.2%   0.0%     0.0%    0.0%    0.0%
               cpu03   15.8%    0.0%    0.0%   0.0%     0.0%    0.0%   84.2%
    Mem:  3094804k av, 2043528k used, 1051276k free,       0k shrd,  165572k buff
           694972k active,             703748k inactive
    Swap: 3145656k av,       0k used, 3145656k free                 1401600k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    17222 rts       25   0  2560 2560  1616 R    24.9  0.0   0:09   2 python
    17221 rts       21   0   888  888   764 S     5.2  0.0   0:02   1 lreader
        1 root      15   0   512  512   452 S     0.0  0.0   0:08   1 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   3 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
    
    
    So "clock()" has a little bit better performance, but is still costing us: It appears (again) that we are going through the kernel to get our values. Can't we just read a memory location and get the value?

    Solaris had "ddi_get_lbolt": the number of ticks since the machine rebooted. Looking for something like this on Linux: Seems very system dependent.

    Linux has clock_gettime: At first flush it looks expensive, but if you run with CLOCK_MONOTONIC: It seems pretty fast!
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    23.720u 0.140s 0:24.01 99.3%    0+0k 0+0io 475pf+0w
    
    (Notice that we don't spend much time in the kernel, so that our system time is very small: This means we are not going back and forth between the kernel!) However, I need to "look" at the numbers returned by clock_gettime and make sure they look good.

    The numbers returned by clock_gettime for CLOCK_MONOTONIC don't look for good (the numbers don't change much): It just may this version of Linux doesn't support it. I am currently trying CLOCK_REALTIME which seems to be a more standard. The numbers look good: tv_sec and tv_usec go up well. Here's the timings:
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    34.340u 21.260s 0:55.77 99.6%   0+0k 0+0io 476pf+0w
    
    Sigh. Same problem with gettimeofday: Too much time in the system.

    In looking at CLOCK_MONOTONIC I found out why it's numbers are so good: The result returned is -1, which basically means it's not implemented on this kernel. ;(

    Peter pointed me to /usr/include/asm/timex.h: There's a function which may or may not be implemented depending on the kernel installed: It is a single assembly language instruction for reading the number of cycles. Very machine dependent, but will be very fast and exactly what we want for a "time stamp" if I can get it to work. Currently, it's not configured in the kernel, but I believe it can be (it only works for x586 machines and up, and we are x686). Waiting for Bob Arendt to see if we can do this.

    Talked to Ervin and Derek for a while: Ervin gave us some direction to think about for future/follow on work (Python Parallel, visualization tools for Large Beowulf clusters, dealing with amounts of data (and visualizing it),) that would be useful to the community. The feeling is that the Python grant is too small for us, but Clint could put our name on it.

    Called Clint and caught up. I told him above and we need to talk some more tomorrow after things sink in.

    Wednesday, September 15th, 2004

    Working on the "fast time stamp" problem. The problem with asm/timex.h is that it depends on the kernel you gave set-up (the real explanation given somewhere on the web is that on NUMA boxes, this number would be inconsistent as a scheduled job bounced from CPU to CPU. This isn't a problem CURRENTLY on DL380 boxes because they are UMA). So, this means that ULMA works on UMA but not NUMA. ;)

    Anyway, the single instruction (rdtsc) will work for us, even if it's configured in the kernel. So, I am using inline assembly to get the instruction there: Here's an example from the web and the inline assembly with gcc seems to work:
    /*
     * This little x86 app prints out the # of tight loops performed per sec,
     * averaged over 10 seconds.
     */
    #include 
    #include 
    #include 
    
    #define rdtscll(val) \
         __asm__ __volatile__ ("rdtsc;" : "=A" (val))
    
    #define SECS 10ULL
    
    int main (int argc, char **argv)
    {
    	unsigned long long start, now, mhz, limit;
    	unsigned int count;
    
    	if (argc != 2) {
    		printf("usage: loop_print \n");
    		exit(-1);
    	}
    	mhz = atol(argv[1]);
    	limit = mhz * SECS;
    
    repeat:
    	rdtscll(start);
    	count = 0;
    	/*
    	 * Just burn cycles in a tight, cached loop and measure absolute
    	 * cycles elapsed, vs. loops performed:
    	 */
    	for (;;) {
    		count++;
    		rdtscll(now);
    		if (now - start > limit)
    			break;
    	}
    	printf("userspace speed: %d loops.\n", count/SECS); fflush(stdout);
    	goto repeat;
    }
    
    
    Of course, the most important piece of code is the rdtscll: it reads a monotonic clock to (a single asm instruction! yes!) and it seems to work for us. The relevant piece of code from this example is:
    #define rdtscll(val) \
         __asm__ __volatile__ ("rdtsc;" : "=A" (val))
    
    Excellent! A quick trial shows this working. So, now the scary part: How well does this work with our cprof module? How much will it cost us? In this example, all we do is add the instruction and call it, we don't yet send the info over the Piper.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    22.880u 0.170s 0:23.28 99.0%    0+0k 0+0io 474pf+0w
    
    Notice: no system time (relatively speaking) being used. Just to be sure, here's a capture of top:
     10:00:02  up 8 days,  1:26,  7 users,  load average: 0.63, 0.20, 0.07
    105 processes: 102 sleeping, 3 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   37.5%    0.0%    2.1%   0.0%     0.0%    0.3%   60.0%
               cpu00   37.6%    0.0%    1.0%   0.0%     0.0%    0.0%   61.4%
               cpu01   44.0%    0.0%    0.6%   0.0%     0.0%    0.8%   54.6%
               cpu02   36.4%    0.0%    6.4%   0.0%     0.0%    0.0%   57.2%
               cpu03   32.2%    0.0%    0.4%   0.0%     0.0%    0.4%   67.0%
    Mem:  3094804k av, 2087324k used, 1007480k free,       0k shrd,  166144k buff
           739664k active,             702448k inactive
    Swap: 3145656k av,       0k used, 3145656k free                 1404148k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    20729 rts       25   0  2548 2548  1608 R    24.7  0.0   0:08   2 python
    20728 rts       24   0   888  888   764 R     9.7  0.0   0:03   3 lreader
    19666 wal       16   0  5548 5548  4220 S     1.9  0.1   0:11   3 artsd
    19604 wal       15   0 32696  31M  2452 S     0.9  1.0   6:10   1 Xvnc
    14496 rts       15   0  8708 8708  4128 S     0.2  0.2   1:08   0 emacs
    19678 wal       15   0 13436  13M 11396 S     0.1  0.4   0:02   0 kdeinit
    19680 wal       15   0 10644  10M  9204 S     0.1  0.3   0:05   0 kdeinit
        1 root      15   0   512  512   452 S     0.0  0.0   0:09   3 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
    
    Notice that the system time is "negligible", and the CPU times add up to about 1.3 CPUs, which is 1 cpu of python sending events and .3 cpu of lreader just grabbing them off the queue. Kudos!

    Just for kicks, let only report "every other event" as if we were "just sampling" the code every so often. This adds a "if" check at the top that avoids sending an event every other time. (This still has the timer call in it)
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    20.970u 0.090s 0:21.12 99.7%    0+0k 0+0io 474pf+0w
    
    Every 10 times? (I.e., we only report an event every 10 times)
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    21.170u 0.020s 0:21.36 99.2%    0+0k 0+0io 474pf+0w
    
    Every 100 times?
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    18.990u 0.020s 0:19.18 99.1%    0+0k 0+0io 474pf+0w
    
    So, we can "dial down the events", but it doesn't hurt us that much to see every event.

    You know, fib is a good test to see how fast we can deliver return call events, but we need a real program soon. I think we want to take wireframe.py and "topify" it.

    Looking at the python profiler for idea of how to put this together ...

    I am a bit "unhappy" with what I get when I put the timestamp in with the "copy the relevant frame info" into the event (as opposed to just send the frame point):
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    28.730u 0.440s 0:29.29 99.5%    0+0k 0+0io 474pf+0w
    
    Compare that to just sending the pointer (with the timestanp) directly from the call site:
    aidl380-01 % !tim
    time python testshm.py
    done 10
    done 20
    done 30
    done 35
    23.450u 0.210s 0:23.85 99.2%    0+0k 0+0io 474pf+0w
    
    23.85 seconds vs. 29.29 seconds: Almost 5 seconds! And that's for very short function names. I am thinking that once we get into "real function names" like in "wireframe.py", we'll see much more significant names (80 characters with full paths, etc.) which will really slow us down further. Originally, I didn't think it would be that big of a deal, but 5 seconds is a long time in a "ultra lightweight monitoring architecture".

    Just for grins, how big are names when running wireframe? Here's a sample output of some frames.
     39 39 41 41 77 81 76 73 73 76 76 76 84 74 74 78 78 72 72 72 72 72 72 72 72 29 29 74 74 74 74 74 74 74 74 84 84 82 81 81 82 84 81 70 74 74 70 77 40 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 79 79 81 82 82 81 81 81 84 82 83 82 82 81 82 82 81 83 82 84 84 82 83 82 82 81 82 82 82 81 81 82 81 83 82 84 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 79 79 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 79 79 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 81 82 82 79 79 81 82 82 81 80 80 81 81 77 78 82 81 81 82 78 77 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 81 82 82 79 79 81 82 82 81 80 80 81 81 77 78 82 81 81 82 78 77 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 79 79 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 40 40 40 81 81 43 84 81 81 79 79 81 82 82 81 81 78 78 84 84 84 84 81 81 81 81 84 82 81 81 82 84 81 81 81 84 84 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 77 78 82 81 81 82 78 77 81 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 77 78 82 81 81 82 78 77 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 87 81 82 82 81 87 81 81 81 81 77 78 82 81 81 82 78 77 77 78 82 81 81 82 78 77 82 83 82 82 81 82 82 81 80 80 83 82 81 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 87 81 82 82 81 87 82 83 82 82 81 82 82 81 80 80 83 82 81 81 81 81 77 78 82 81 81 82 78 77 77 78 82 81 81 82 78 77 81 81 82 82 79 79 81 82 82 82 81 81 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 82 81 81 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 82 81 81 82 81 81 81 87 81 82 82 81 87 84 82 81 81 82 84 77 78 82 81 81 82 78 77 77 78 82 81 81 82 78 77 84 78 78 78 73 70 69 69 70 73 86 73 73 72 72 76 76 79 79 72 72 79 79 86 89 79 79 72 72 79 79 89 90 86 76 76 78 78 73 73 73 73 86 86 73 73 79 79 72 72 79 79 86 72 72 73 70 69 69 70 73 73 76 76 76 76 77 75 68 67 69 69 67 68 68 67 69 69 67 68 68 67 69 69 67 68 68 67 69 69 67 68 75 69 70 78 78 77 78 78 73 76 74 74 76 76 76 78 74 76 76 71 74 74 71 74 74 71 74 74 71 75 75 75 75 80 76 76 80 75 75 75 75 79 79 79 79 73 73 79 79 71 74 74 71 75 75 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 73 73 78 71 74 74 71 73 75 83 76 76 76 76 83 78 77 76 76 79 79 77 78 79 79 79 79 78 78 75 77 70 69 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 79 79 79 79 79 79 79 79 89 76 76 78 78 89 73 73 89 79 79 72 72 79 79 89 90 82 82 81 81 81 81 84 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 78 78 85 85 85 85 85 84 94 76 76 89 79 79 72 72 79 79 89 94 81 81 86 89 76 76 78 78 89 80 77 77 80 86 81 81 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 78 29 29 41 91 91 41 36 79 79 77 77 69 69 77 77 69 69 43 43 87 80 81 82 82 81 80 
    
    So, lots of bigger numbers, like 70-80 characters. For "testshm.py", we see numbers like 33 or so. Let's, just for grins, make every event 80 characters and see what the slowdown is (when we build the event manually by copying into the the buffer instead of just sending a pointer).
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    30.440u 1.140s 0:31.60 99.9%    0+0k 0+0io 474pf+0w
    
    So, we lose about 2-2.3 seconds (from 29.29 to 31.6) for copying larger filenames over. Curses! I want the 23 seconds number. The only way to do that is to map the python heap into our space.

    THoughts: Maybe this method is fast enough, we lose some time by building bigger events, but not a lot: We do gain some time (25-50%?) by sending very small events with just the pointer and timestamp. But there's a hidden cost there: If send the event with pointers we have to (a) inc the ref count on the frame so it doesn't go away which could cause (b) memory "bloat" [frames that are created quickly and go away might stay around longer causing memory to get bigger than it would be in a normal program and therefore affect the performance by causing cache misses, less memory reuse, etc.] and finally (c) when and who does a decref on the python frame when we are done with it?

    First thought, have a stream coming from the monitor with DECREF events. Blech, too slow, (probably at least 2x hit to read from other direction) and to be fast it would to buffer events which would cause more memory bloat as frames would have to sit around in memory even longer. Second thought, bad coding practice, but have a list of items to be decred after a "second" (a timer which would "collect" all the frames). This is a dangerous method (becaue we couldn't "gurantee" we've processed the frames after a second and would have to worry about dangling pointers?) and we would like to decref the frames ASAP to avoid memory bloat. Third thought: Have each frame in an object container (i.e., add another reference count): One reference count is the "normal" ref count, the other ref count will be SET the the monitoring process. When a frames normal reference count goes to zero, it does a "check" (simple read) of the other ref count. If that ref count is zero as well, it get collected normally. If it's not, the monitor process is still hanging onto it so the object hangs out. When the monitor is done with the frame, it sets the ref count to zero (bTW, notice, no sync required yet) and puts it on some list [yes, that will require some sync between processes]. At some point, the monitored process will process the list (probably on the next deallocation of memory) and destroy the object then. There is a fine line here: We want to avoid synchronization [which we will need when we put things on/off a list], but we also want to service the list as often as possible to avoid heap bloat. This solution may work, but I still don't like it. Its on the way to being a solution ... Forth thought: Ahh! We don't actually need synchronization every time we check the list, only when there is actually something on it: We can get by with just an if check (is the list empty? move on) without any sync. Fifth thought: It seems, though, that these events will quickly come back, so it seems like there will always be something on that list. How about a list where insertion and deletion can be relatively sync free (as long as the front and back are "far enough" apart?). Actually not sure how to do this. Sixth thought: There's a eureka type thought here: The monitor can spend time syncing, we just want the monitored program to sync as little as possible. So, maybe with a "atomic swap" instruction, we can get things off the list with no syncronization and make the monitor pay extra to try put things on the list without the monitored program having to pay.

    Can you just every second collect things on the list? (No no, we just said that will cause memory bloat).

    Have a "volatile" index into a list, we may occasionally think the list is fuller than it really is. This might work.

    Thursday, September 16th, 2004

    Looking under the hood of the Python interpreter: most allocations are wrapped up in "PyMem_MALLOC" or "PyMem_NEW" and everything seems to default to "malloc" underneath. Frame objects are a little interesting: Since they are allocated and deallocated so quickly, there is a "pool" of frame objects that are always reused (standard practice), but I am a bit embarrassed to say that I saw some raw "mallocs" in the Python baseline. Really, it seems like most allocations show call one of the std macros/functions, and many many do. But there were just enough "malloc"s to give me pause.

    My hope was that it would be "easy" to intercept mallocs. It may still be, but here's the issue: We have to map a shared memory region in the monitored process to a region in the monitor, but the only way I know to do this is using shm_open and you have to be opened BEFORE. I suppose I should do some research on different ways to map memory regions? Ideally, I could just "open" a section of my process and share it. I am sure Linux has away to do this, but is there a "portable" way to do this?

    Greg suggested either the proc filesystem or ptrace. Ptrace was called by the UNIX man pages as "unique and arcane". Which it is. Apparently, Solaris uses the /proc filesystem as well as UNIX, so perhaps /proc is more portable than the man pages claim.

    Nope: Problem with /proc: You have to "read" parts into a temporary buffer. You don't get the speed of just a single read access (a single load) in the monitor: You have to read from the file system the values you want. It would be nice if you could "mmap" the /proc//mem file into memory. But that doesn't work (at least, not on RedHat Enterprise 3.0). This still might intersting to try, because maybe (really depends on the implementation) it might be quick to copy out of the "mem" file into the temp buffer. We know this is slowish, but it might be fast enough so that the monitor can still keep up the monitored program. Of course, it's not portable.

    The other idea Greg had was to hack malloc (because it uses mmap anyway) and then use the shm_open/mmap idea.

    The man pages claim that you can look at memory with open/reads. But the real question: Is it just a "copy" done at the time of the read? Or is it actually the memory page? I am almost sure it's a copy ... and I don't want to do a "read" of memory (through the SYSTEM CALL interface at least: I want a single instruction!).

    Perhaps it's time to revisit threads? At least there, I can get the addresses and not do any translation. Of course, then we are sharing the heap (read/write) and that's the LAST thing I want: The profiled program

    Friday, September 17th


    Another idea comes to me for handling the "freeing of frames": Inside the trace routine, INCREF the ref count on the frame (so it doesn't die), and it put it on a list. Put the event to the Piper. Traverse the list and see if any of them have been "freed" (the monitor just sets a flag saying it's done). Free the ones that have been freed. This solution doesn't require "synchronization" because only the monitor can "set" the special flag that the frame can be DECREFed, and the monitored program is the only thing that puts/deletes items off the list, so there's no need for synchronization. The question is, does the "checking" become a linear time activity? (Searching through all the elements at every event to see if the one field is set). Of course, we could defer the checking until after a full buffer has been sent (there's no reason to check until we know the buffer has been sent), but still the same problem exists.
    Let's assume that the monitor will always make progress and "consume" the frames as fast as it can: After every "full buffer ship", sometime thereafter all the frames will be consumed: We could have a counter of frames processed, and just check the counter. The monitor would be decrementing, and only when it hit zero would we claim all the frames. This will work, without needing syncronization (because the monitored program will just be checking a counter relevent to the current set of events just put), but it will defer deletion to when ALL the frames have been processed. I want something that will process the frames ASAP.

    Okay: Can we assume that the events will be processed in order? I think we can! I think we can assume that the first event processed on the monitored side (and put in the list) will be the first event processed on the monitor side. Yes yes yes: Have a counter and a queue. The monitored puts the events on the queue in order and increments it's counter for each event. After the event ships, only the monitor can decrement the event ... on the monitored side, whenever the counter decrements, we simply "DECREF" the frames on the list IN ORDER (only the monitor takes stuff off the list). Yes yes yes! The only communication is through the one counter, and it doesn't need to be synchronized: the worst thing that can happen is that we "snapshot" the counter and get a previous "big" number, but that means we just don't delete the frames until a little later. Yes! No sync, quick response to deleting farmes, and minor complexity.

    Showed the idea to Scott: He actually pointed out a subtle improvement. Rather storing a "counter" (which can go awry if we aren't careful about synchronization), have a single location that represents "the last frame worked off". So, rather than having to worry about the counter getting too big and such, just "look for that frame" when figuring out which frames to decref: So: while (LAST_FRAME_OF_MONITOR != current_frane) DECREF. Super easy! (Use a circular buffer as the queue). I think this is a paper.

    Monday, September 27th

    Returned to Rincon after a brief stay in Las Cruces for recruiting. We are currently set to give a review of the state of things on Oct. 11th, so I want to have about 5 demos if possible.
    (1) Original algae demo (first thing we got working)
    (2) List monitor demo (nova)
    (3) Better algae demo (with proc call clicks?)
    (4) Some other hooks (Strings, memory?)
    (5) top 
    
    I also ran my idea of the paper my Peter: he thinks it might be good, it just needs some research (ACM digital library, here I come!). I need to talk to Mary about that.

    Spent some time today figuring how to get a top demo going. Also want to spend a day quantifying the ULMA shared memory solution.

    Monday, October 4th, 2004

    Spent a lot of time thinking about data structures last week. (Also doing SSGS stuff taking up time). This is why I propose that all containers should use the same code base: you don't know what the STL containers implement (how do they implement map?). Anyway, I got something compiling and going. I noticed an interesting phenomenon: we are back in the arena where "how fast we read the messages" is important again. Here's why:

    As we read profile events from the shared memory section, we take the event off the pipe and use the timestamp in a few computations. The problem seems to be that we are not consuming the packets fast enough, I.e., we are doing too much work per packet! So, I want to spent a little time making this faster. The first and easiest optimization is noting that returns don't need nearly as much information. But, probably the biggest optimization will be reading the packets directly into a struct (that has to be aligned so we don't run into weird issues).

    Tuesday, October 5th, 2004

    So, at the end of yesterday, I had some data coming to top (I am still worried about performance, but in the interest of having a demo for next Monday I am avoiding this right now), but I was getting weird answers. And here's why: I was converting my int_u8 to real_8s to do some "% math" and found that converting int_u8 to real_8 doesn't seem to be supported very well under Linux? I remembered that converting to an int_8 then a real_8 works, but that didn't seem to work for me. I did find a web page that purports to solve the problem and well: http://www.cbau.freeserve.co.uk/Compiler/LongLongConversion.html This page is worth while enough that I have replicated in its entirety on the Wiki. Here the LongLongConversion page.

    Showed the topdemo to Bob: Suggested that we have a button to 'clear out' the history so we can start profiling anew.

    Okay, some good progress here today. First, there is no problem converting between int_u8s and real_8s: but printf has a hard time with them! I think that is the reason things weren't working well. I use printf to print the formatted top stuff, and now I convert everything to real_8s before I print it.

    Second problem today: I am using an int_u8 "timestamp" that seems to have pretty high resolution, so I was actually overflowing my int_u8s with times. There is, apparently, a long double on Linux which gives me 12 bytes of real, so I do all my accumulations there.

    Third: I was sending 2 bytes too few so my timestamps were corrupt. So, I need to revisit the accumulating parameters (my timestamps were larger than they should have been) as long doubles, but it's good to know that that works.

    Changed the sense of the AVLTree compares so that I can just iterate through the first 25 or so and print those out.

    Was trying to figure out why "mainloop" seems to be the only culprit for printing weird numbers, and why a few of the times I get that my time ends are less than my starts? Turns out that python 2.3.4 CANNOT trace into C routines. This is fixed in python 2.4 apparently, but a couple of places "mainloop" is called from C, but apparently "returned from" in Python? That's the most likely explanation at the moment.

    Anyway, the good news is that I have a "topdemo" working. I will spend the next few days cleaning it, making it interactive and hopefully making it better.

    Thursday, October 7th, 2004

    Had a false start yesterday: Thought I needed a Deque class but really didn't. I wanted to make sure I could build exactly one FrameIdentifier and make sure everyone pointed to it and that it DID NOT MOVE within the collection. Not sure I needed this strict of a guarantee, but I think that most collections (Deque, Map, Set) should guarantee this. Obviously vector should NOT have this requirement (the "resize" thing and guarantee that it's contiguous). Derek disagrees with me, stating that the "value" is important, but the "extra" work to make sure you have an extra container which contains the values which your pointers point to seems a bit much: The "values do not move" is actually not too terribly difficult an invariant to keep, not too much cost on efficiency, but darn it if it doesn't make things a lot simpler. Makes your iterators potentially more safe to use for deletions. But, I digress.

    Super good news from today: Got the profile event down. My "topdemo" was working, but the fibonacci demo was eating it alive: It would take almost 3 minutes to run the fib demo that take 18 seconds on an "uninstrumented machine". Completely restructured the event so that when it goes over into sharedmemory, it is a C-struct with C style strings. So, on the other side, the reader can "use the struct" WITHOUT COPYING IT OUT OF MEMORY. If it turns out it's a new frame, then of course it gets copied out but very quickly as a C-Struct. I have sharing going on all over the place so only a few actual structs get built. This brought the topdemo/testshmpy demo down to 38 seconds, about 2x. Part of the reason for this is that the event struct is a little bigger: always about 90 characters instead of about 40 before (I have 40 chars for two names). The good news is, we keep up: Under top "python takes a full CPU" and topdemo "takes half a CPU". This is the goal. Of course, I could probably cut the "python testshm.py" down by 30% if I could JUST send the frame pointer (cuts memory traffic by, 90%?). The "best" we can do right now is about 21 seconds or about 20% slower.

    Got a few more test programs under the belt: Sieve Of Eratosthenes and a text to pdf converter. Here's some sample runs and run times. (First, without the profiling)
    aidl380-01 % time python pyText2Pdf.py wiki3.txt
    Using font Courier  size = 10
    Input file => wiki3.txt
    Writing pdf file wiki3.txt.pdf ...
    Wrote file wiki3.txt.pdf
    17.370u 0.030s 0:17.63 98.6%    0+0k 0+0io 459pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof/samples )
    aidl380-01 % 
    
    Then, WITH
    aidl380-01 % time python pyText2Pdf.py wiki3.txt
    Using font Courier  size = 10
    Input file => wiki3.txt
    Writing pdf file wiki3.txt.pdf ...
    Wrote file wiki3.txt.pdf
    21.910u 0.150s 0:22.15 99.5%    0+0k 0+0io 496pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof/samples )
    
    So, it does add overhead. (This example was chosen because it has a little more I/O but it is still CPU intensive). For grins and giggles, here's the last topdemo output:
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :writestr        : 3.6311e+10 98.1 : 00.0 : 99.9 :         pyText2Pdf.py:  316 
    :StartPage       : 5.1111e+08 01.4 : 69.7 : 30.2 :         pyText2Pdf.py:  430 
    :EndPage         : 1.6711e+08 00.5 : 70.3 : 29.6 :         pyText2Pdf.py:  481 
    :?               : 3.3821e+06 00.0 : 00.0 : 99.9 : b/python2.3/string.py:   20 
    :WriteHeader     : 9.7501e+05 00.0 : 73.5 : 26.4 :         pyText2Pdf.py:  374 
    :parseArgs       : 2.6778e+05 00.0 : 19.2 : 80.7 :         pyText2Pdf.py:  282 
    :?               : 9.2917e+04 00.0 : 10.2 : 89.7 : b/python2.3/getopt.py:   16 
    :__init__        : 4.4529e+04 00.0 : 00.0 : 99.9 :         pyText2Pdf.py:  152 
    :getopt          : 3.8185e+04 00.0 : 00.0 : 99.9 : b/python2.3/getopt.py:   52 
    :pyText2Pdf      : 2.8017e+04 00.0 : 00.0 : 99.9 :         pyText2Pdf.py:  150 
    :argsCallBack    : 1.3365e+04 00.0 : 00.0 : 99.9 :         pyText2Pdf.py:  197 
    :GetoptError     : 9.4650e+03 00.0 : 00.0 : 99.9 : b/python2.3/getopt.py:   39 
    
    
    
    Here's two runs of the Sieve, the first WITHOUT profiling information, the second WITH.
    aidl380-01 % time python testera.py 
    1299709
    2750159
    4256233
    5800079
    7368787
    8960453
    10570841
    12195257
    13834103
    15485863
    17144489
    18815231
    15.450u 0.010s 0:16.02 96.5%    0+0k 0+0io 436pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof/samples )
    aidl380-01 % time python testera.py
    1299709
    2750159
    4256233
    5800079
    7368787
    8960453
    10570841
    12195257
    13834103
    15485863
    17144489
    18815231
    17.150u 0.110s 0:17.37 99.3%    0+0k 0+0io 478pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof/samples )
    aidl380-01 % 
    
    
    The good news here is that sieve, a very cpu intensive program, doesn't take much of a hit. That's probably because the number of procedure call events is small: I think most of the work is done in the C routines, which the current version of Python does not generate events for.

    Thursday, October 14th, 2004

    This week has been weird: My stomach has been bothering me, so I have been going home early.
    The good news: We had our IRAD review and it went well. The outcome: Write some papers so we can get our name out there and write some NSF or DARPA grants. The good news is that we have two reasonable ideas for papers. The other good news is that Beowulf seems to be in everyone's future, so we are thinking the next thing to do is put visualizations up for Beowulf programs. We could get AS&T funding, maybe collaborate with NMSU, take advantage of our usage, etc.

    Currently, we want to take our current demo and make it better and more useful. I have been talking with Jaime and Mike about using this tool ... problem: They are ALL Tru64. Have to port things to Tru64. Talked to Bill Reese about that a little bit, and he had some good ideas (as always).

    Friday, October 15th, 2004

    A flurry of activity yesterday as me and Derek played a cryptography toy program he has: we optimized it from 1 min, 30 secs down to 18 seconds! Originally written in Python, it takes hours. Rewritten in C++, it takes minutes, with optimization, seconds. BUT: That's not good enough for me! So, I am taking the ULMA work and making it so C++ programmers can access the tools. Here's a copy of the e-mail I sent to Derek describing how it will work:
    
    I shaved off MAYBE another second.  So I started thinking... "I wish I
    had ULMA for this
    decrypt."   And then I realized, I can!
    
    I am working on something simple that users can compile into their
    programs:
    
    main()
    {
       while (...) {
    
       }
    }
    
    You might instrument with:
    
    #include "ts.h"
    EventWriter MyEventWriter("hello", 'W')
    #define PROFILER_WRITER MyEventWriter // undef to turn off profiling
    
    main()
    { TS("main");
        while (...) { TS("first while");
    
    
        }
    }
    
    Where TS becomes ... ->  TimeStanpScope ps("function name", __FILE__,
    __LINE__)
    
    Which generates a "call event" at TimeStampScope construction and a
    "return
    event" and TimeStampcope destruction.   This way you can keep
    instrumenting and
    taking out profiling.  Eh?
    
    With this, we can use the ULMA tools in C++ too!
    (And in C:  There's a C routine called "generateULMAEvent" which is
    called for C++ constr/destr).
    
      Gooday,
    
      Richie
    
    P.S>  Curse you!  I can't get isopatterns out of my head! ;)
    

    Monday, October 18th, 2004

    Been thinking: Although the rdtscll asm call has been giving us our speedup, we are reading on "on-chip" register. On the DL360s/380s, we are getting lucky because they they are UMA (ULMA is UMA!) and the registers are initialized to 0 at power-up, at "approximately" the same time. My basic tests seem to indicate that they in sync, but we want some assuredness for this: especially if we want to move to Beowulf clusters. I did some searching on the web and did find some links which "imply" there are tools (for beowulf) that imply the registers acrosss a cluster are in sync (google TSC synchronize and you'll find some logs that seem to indicate this). So, it sounds like the BeoWulf clusters have this ... and it looks like they do what I expect: an "NTP-like" protocol to check sync (it turns out you can't SET the registers: you can only power them up at the same time for sync). So, future versions might have some kind "id the processor on, add the offset from true timestamp and continue". Of course, the right way to set the offset is via the scheduler: The scheduler is the entity that "moves" processes around, so it should know, with very little work, this information. Sigh. Here's the question: If we continue down this road with ULMA, trying to write a paper, etc., it is VITAL that we can have confidence that MultiProcessor TimeStamps are in sync. Without this guarantee, the information means nothing. We are getting away with it now because we have a "smallish" UMA multiprocessor which "seems" to be booting simultaneously and setting. This is especially important if we head towards BeoWulf clusters. If we go toward NUMA architectures, then perhaps we have to have a scale factor as well as an offset when read the TS register. So, is this another research project? Is this two papers? Is this already done? Some more bad news: Even the rtdtscll instruction is prone to problems:
    	
    	Table of Contents
    			
    	Introduction
    	Background
    	Issues with Performance Monitoring with Intel® SpeedStep®
    	Conclusion
    	Additional Resources
    	Sources
    	Executable
    	Dynamic Link Library
    	Driver
    		
    
    	Page Options
    			
    	Printer friendly version
    		
    	
    Issues with Performance Monitoring with Intel® SpeedStep®
    As of this writing, the Single Processor HAL (Hardware Abstraction Layer) of Windows 2000*/XP uses the same 1.193180 megahertz counter for its "QueryPerformanceCounter" API as all previous versions of Windows. However, the Multiprocessor HAL of Windows 2000 and XP use the system's own clock rate through the RDTSC processor instruction. This causes problems for software where performance monitoring is underway and Intel Speed Stepping is enabled as the values read from the RDTSC instruction are varying with respect to frequency and hence the timing values read are typically skewed.
    
    By changing the power usage on a system running Intel Speed Stepping, such that Speed Stepping is effectively pushed to where no frequency changes occur, the RDTSC instruction becomes reliable from one read to the next and is hence useful for the purposes of code profiling. This can be done simply enough by indirect calls to the Processor Power Scheme API calls ReadProcessorPwrScheme and WriteProcessorPwrScheme. However, the Windows XP operating system does not allow for a method of being interrupted when the Power scheme changes. The purpose of this mechanism is to be provided notification in the event some process, or user intervention changes the power scheme to other than the non-frequency changing state set state while the profiling session is underway.
    
    The method described here utilizes a power-managed driver to detect these subtleties and notify an event handler in user mode process space. In doing this, the user mode process can defer any power scheme changes to after the profiling session is completed thus retaining consistency in the profiling metrics.
    
    Performance Event Interception
    
    The method involves first registering a power managed event callback (DeviceDispatchPower) in a driver running in Kernel Mode. The Driver source for this routine is listed in section 2.2.2.3. This routine responds to all Power Managed events in the operating system. In a standard Power-managed scenario this routine is provided to the driver so that it can change the power states on the hardware that it is controlling. Since the Driver is filtering Power events for the entire system it provides an easy scheme for filtering these events and preventing deleterious event changes by deferring them off to a time after the profiling session has completed.
    
    (From the intel web site). Basically, if "power saving" settings are used, the rdtscll call becomes inconsistent. The intel site mentions a solution for the XP, but is there one for Linux? Actually, it sounds like "yet another axis" of problems to address. Again, we are probably getting luck because the Xeon probably doesn't use this? This link looks like it has some information: It's the smp-boot code for Linux. http://www-d0.fnal.gov/D0Code/source/l3xsbc_distrib/linux/arch/i386/kernel/smpboot.c Perfmon utility: http://www.cse.msu.edu/~enbody/pc_perfmon/ Beowulf: http://www.beowulf.org/pipermail/beowulf/2003-March/006131.html

    Wednesday, October 20th, 2004

    Today I figured out why the C++ TS utility wasn't working. It turns out I was creating a temporary TimeStampScope("name") without a variable name! This allows the C++ compiler to realize it's a "temp" and call it's destructor at the end end of the "temp scope", which is immediately! As soon as it becomes a real varaible TimeStampScope ts("name") then everything works fine. It has been interesting taking Derek's decrypt program and playing with it trying to make it faster. It looks like now we have real information about a real program in C++. Cool! So, another checkmark for the IRAD.

    Back on track to mapping the frames across the processes for speed. Reminder: Here's how long fib takes WITHOUT any instrumentation.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    17.470u 0.010s 0:17.58 99.4%    0+0k 0+0io 433pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    
    
    Here's with an eventwriter that sends over the frame, basically as is as much as possible.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    22.500u 0.210s 0:22.83 99.4%    0+0k 0+0io 473pf+0w
    
    I am convinced this is about the best we can do. The question NOW, is can we actually keep the frames around long enough to actually get the information out? Here's another test: INC and DEC the refcount: We'll have to do at LEAST this much work to keep the frame around. What's the effect?
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    22.890u 0.250s 0:23.27 99.4%    0+0k 0+0io 473pf+0w
    
    Half a second? Okay, not bad.

    Some more numbers: What if we run the topdemo in the same process? I.e., just directly post the events to some data structure. Currently, this builds a data structure and does just as many copies as chunky events and shares the heap.
    aidl380-01 % time python testshm.py 
    done 10
    done 20
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.9128e+11 99.9 : 99.8 : 00.1 :            testshm.py:    2 
    done 30
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 3.9541e+11 99.9 : 96.5 : 03.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 6.0489e+11 99.9 : 97.7 : 02.2 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 8.2450e+11 99.9 : 98.3 : 01.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.0647e+12 99.9 : 98.6 : 01.3 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.2607e+12 99.9 : 98.8 : 01.1 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.5483e+12 99.9 : 99.0 : 00.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.7446e+12 99.9 : 99.1 : 00.8 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.9771e+12 99.9 : 99.2 : 00.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.1712e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.4981e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.7065e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.9259e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 3.1694e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 3.3810e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    done 35
    WARNING:  starting profiling inside a routine, won't 
    count first return
    49.560u 0.010s 0:49.71 99.7%    0+0k 0+0io 486pf+0w
    
    Which is pretty much what I expected: When a full topdemo is running, it almost takes up a full CPU (actually I expected about 40 seconds, this is longer than I expected). When the topdemo is "folded in", then the almost two full CPUS becomes 1 very loaded CPU. The sad thing is that the "heap" is shared in this case. Don't like it, but that's the way it is. At one point, I will have to rewrite top demo so it can have a dedicated heap: separate from the main program.

    The next step is to put a thread around the topdemo and only send lightweight events (which immediately get transformed into chunky events in the monitor thread). Why do we have to do this? Remember, Python has the ONE BIG LOCK problem around the interpreter: We can't have two threads upping ref counts. What we CAN do is have is possibly have the monitored program simply up the ref count on the filename and program name (or do we? Do those things live forever?) For the first test, let's just turn the lightweight event into a chunky event, but have the MONITOR do the translation. Oh wait, can we? Ugh. Again, same problem. We decrements the ref counts?
    
    
    
    

    Tuesday, November 2nd, 2004

    Busy week. Met with Clint and Dr. Pontelli on Friday. He seems interested in collaborating. I think the way we put it was that we'll collaborate, I'll do most of the work, we'll try and write some NSF grants together with Clint, get a RRC/NMSU joint grant? Met with Micheal for a good portion of Saturday. We discussed where we were going: We want to get something together for PyCon 2005. The next tools he'll be working on are the Memory Display tools. My next visit, we'll try to put EVERYTHING together that we've done. Micheal and I also talk about the event-exchanging protocol. He had a very interesting observation, that UNICON doesn't use reference counting it (essentially) uses mark-and-sweep (kind of a copy collector). Allocations in UNICON are single adds: rather than pooling allocations, let's just use the "single add" for allocation and avoid all the overhread. This has a different performance profile than Python with its reference counting. This implies we will probably have to use a different protocol for performance. Excellent! (Sync at boundary barriers). Had a meeting with Clint and Micheal, later on Saturday, discussing future. I am trying to get the core of our paper written. I had the IRAD presentations yesterday: I wasn't at the top of my game (I believe I have been coming down with something) but they went okay. I think the IRAD committee was most excited about the Eclipse (an IDE for Linux) idea. The BeoWulf piping system sounded like it may be funded by someone else in the government at San Diego? So, My guess is that IRAD won't happen, or they will at least wait to see what happens with the San Diego money. I think my ULMA IRAD will be funded for a little bit, at a smaller rate so we can finish it off.

    Did some work yesterday: Got a POSIX thread running consuming LightWeight events, and was dismally disappointed with the performance. Discounting the fact that they share the heap (perhaps that's the bottleneck?), even just sending the lightweight event and turning it into a FrameIdentifier in the monitor was dismal timing. I think, for speed's sake, I HAVE to write a lightweight topdemo where the FrameIdentifier just has pointers. Really, something is going on here that I am not getting ... maybe because two threads ARE sharing the heap? Hm ... that was a HUGE deal in Midas 2k and led me to write OpenContainers.

    Another idea, to speed up the Chunky Events is make the hash function a little simpler. (Just hash on the name and number, and don't use the filename).

    Wednesday, November 3rd, 2004

    Okay. After yesterday's "failure" (times seem to be terrible for sending lightweight events), I am backing up and re-running the whole experiment. First, how long does it take to run the fib program with NO profiling whatsoever?
    aidl380-01 % time python testshm.py 
    done 10
    done 20
    done 30
    done 35
    18.160u 0.020s 0:18.33 99.1%    0+0k 0+0io 480pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % !!
    time python testshm.py
    done 10
    done 20
    done 30
    done 35
    18.640u 0.020s 0:18.72 99.6%    0+0k 0+0io 480pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % !!
    time python testshm.py
    done 10
    done 20
    done 30
    done 35
    18.190u 0.010s 0:18.21 99.9%    0+0k 0+0io 480pf+0w
    
    Average wall clock time of about 18.4 seconds. Now, sending lightweight events to ANOTHER PROCESS, but that process does NOTHING with the event (except read it).
    aidl380-01 % time python testshm.py 
    done 10
    done 20
    done 30
    done 35
    22.390u 0.100s 0:22.62 99.4%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    21.980u 0.050s 0:22.06 99.8%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    21.670u 0.090s 0:21.90 99.3%    0+0k 0+0io 511pf+0w
    
    About 22 seconds of elapsed time. Let's look at a "top" while the programs are running: Notice that the CPU is pegged (as expected) for the fib program, and the "sitereader" (a do nothing except read the SHMfib Shared memory event queue) is a mere 0.9%. Notice that the total CPU is 26.5% .. this is the 24.9% + .9% (plus other). This will be important to watch later to see if we are getting real concurrency.
     09:31:18  up 20 days,  2:05,  7 users,  load average: 0.46, 0.34, 0.39
    105 processes: 103 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   26.5%    0.0%    0.6%   0.0%     0.0%    0.2%   72.6%
               cpu00   30.8%    0.0%    1.2%   0.0%     0.0%    0.0%   68.0%
               cpu01    1.4%    0.0%    0.6%   0.0%     0.0%    0.4%   97.6%
               cpu02   71.2%    0.0%    0.2%   0.0%     0.0%    0.0%   28.6%
               cpu03    2.8%    0.0%    0.4%   0.0%     0.0%    0.4%   96.4%
    Mem:  3094804k av, 3001104k used,   93700k free,       0k shrd,  270084k buff
                       1627584k actv,  745400k in_d,   52268k in_c
    Swap: 3145656k av,    8860k used, 3136796k free                 2244528k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    25653 rts       25   0  2704 2704  1740 R    24.9  0.0   0:14   2 python
    25652 rts       20   0   888  888   764 S     0.9  0.0   0:00   3 sitereader
     3130 wal       16   0  5568 5568  4220 S     0.5  0.1   2:30   3 artsd
     3145 wal       15   0 11388  11M  9764 S     0.1  0.3   1:01   1 kdeinit
    25332 wal       15   0 10080 9.8M  7452 S     0.1  0.3   0:03   3 gaim
        1 root      15   0   508  476   448 S     0.0  0.0   0:20   0 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
       10 root      34  19     0    0     0 SWN   0.0  0.0   0:00   3 ksoftirqd/3
       13 root      15   0     0    0     0 SW    0.0  0.0   0:01   1 bdflush
    
    Let's now try reading the event queue FROM A THREAD IN THE SAME PROCESS, where the reader just throws away the event. We expect exactly the same performance, right? After all, we are using a kernel thread which should be able to run on another CPU.
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    Top Demo for LightWeight Events
    here?shmfib
    done 10
    Opened ...
    done 20
    done 30
    done 35
    21.600u 0.050s 0:21.83 99.1%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    Top Demo for LightWeight Events
    here?shmfib
    done 10
    Opened ...
    done 20
    done 30
    done 35
    21.620u 0.020s 0:21.67 99.8%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    Top Demo for LightWeight Events
    here?shmfib
    done 10
    Opened ...
    done 20
    done 30
    done 35
    21.350u 0.080s 0:21.44 99.9%    0+0k 0+0io 511pf+0w
    
    21.6 seconds? Maybe a bit faster because of caching? Notice the top profile:
     09:42:58  up 20 days,  2:16,  7 users,  load average: 0.32, 0.20, 0.25
    104 processes: 102 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   25.1%    0.0%    0.5%   0.0%     0.0%    0.0%   74.2%
               cpu00   99.0%    0.0%    0.9%   0.0%     0.0%    0.0%    0.0%
               cpu01    0.9%    0.0%    0.3%   0.0%     0.0%    0.0%   98.6%
               cpu02    0.1%    0.0%    0.0%   0.0%     0.0%    0.3%   99.4%
               cpu03    0.3%    0.0%    0.7%   0.0%     0.0%    0.0%   98.8%
    Mem:  3094804k av, 3000064k used,   94740k free,       0k shrd,  270084k buff
                       1627516k actv,  745424k in_d,   52268k in_c
    Swap: 3145656k av,    8860k used, 3136796k free                 2244552k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    25690 rts       25   0  2736 2736  1764 R    24.9  0.0   0:17   0 python
        1 root      15   0   508  476   448 S     0.0  0.0   0:20   2 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
       10 root      34  19     0    0     0 SWN   0.0  0.0   0:00   3 ksoftirqd/3
       13 root      15   0     0    0     0 SW    0.0  0.0   0:01   1 bdflush
       11 root      15   0     0    0     0 SW    0.0  0.0   0:29   1 kswapd
       12 root      15   0     0    0     0 SW    0.0  0.0   0:23   2 kscand
       14 root      15   0     0    0     0 SW    0.0  0.0   0:17   1 kupdated
       15 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 mdrecoveryd
    
    
    Hm. Another full CPU with about 25% on another. Notice that the top on Linux (for this version at least) will never give us more than 25% for CPU since this is a "4CPU" machine.

    Now, let's have the other thread not just read the events, but put them in the "top data structure". There is actually some sharing of heap going on here, but not much.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    26.540u 0.130s 0:29.27 91.1%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    26.540u 0.180s 0:29.30 91.1%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    26.180u 0.140s 0:28.97 90.8%    0+0k 0+0io 511pf+0w
    
    About 29 seconds. Hm. I expected better. This feels about the same time as chunky events. But this seems so wrong! It feels like we should be at least twice as fast as CHUNKY events. Hm.
    Notice the top output below: We are DEFINITELY getting concurrency, as we are starting to use almost half of a 4_CPU machine (top doesn't set out the threads though, so it still reports the python program as only 24.9%). But notice the CPUs are being hammered: about 2 CPUs worth: One for the reader and one for the writer.
    103 processes: 101 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
     09:48:35  up 20 days,  2:22,  7 users,  load average: 0.88, 0.43, 0.31
    103 processes: 101 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   47.6%    0.0%    0.1%   0.0%     0.0%    0.1%   52.1%
               cpu00   44.6%    0.0%    0.2%   0.0%     0.0%    0.4%   54.8%
               cpu01    3.2%    0.0%    0.0%   0.0%     0.0%    0.0%   96.8%
               cpu02   51.6%    0.0%    0.2%   0.0%     0.0%    0.0%   48.2%
               cpu03   91.2%    0.0%    0.0%   0.0%     0.0%    0.0%    8.8%
    Mem:  3094804k av, 2994820k used,   99984k free,       0k shrd,  270084k buff
                       1623404k actv,  745428k in_d,   52268k in_c
    Swap: 3145656k av,    8860k used, 3136796k free                 2244556k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    25721 rts       25   0  2864 2864  1884 R    24.9  0.0   0:12   3 python
    25332 wal       15   0 10164 9.9M  7468 S     0.2  0.3   0:05   0 gaim
        1 root      15   0   508  476   448 S     0.0  0.0   0:20   0 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
       10 root      34  19     0    0     0 SWN   0.0  0.0   0:00   3 ksoftirqd/3
       13 root      15   0     0    0     0 SW    0.0  0.0   0:01   1 bdflush
       11 root      15   0     0    0     0 SW    0.0  0.0   0:29   1 kswapd
    
    
    All right, let's add in the "showTop" for completeness.
    aidl380-01 % time python testshm.py 
    event_queue_name = 
    shmfib!
    Top Demo for LightWeight Events
    here?shmfib
    done 10
    Opened ...
    done 20
    done 30
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0256e+11 99.9 : 95.2 : 04.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 4.0175e+11 99.9 : 97.5 : 02.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 6.1219e+11 99.9 : 98.3 : 01.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 8.4691e+11 99.9 : 98.8 : 01.1 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.0927e+12 99.9 : 99.0 : 00.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.3107e+12 99.9 : 99.2 : 00.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.5860e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.8066e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0322e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.2349e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    done 35
    27.510u 0.140s 0:31.72 87.1%    0+0k 0+0io 510pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    Top Demo for LightWeight Events
    here?shmfib
    done 10
    Opened ...
    done 20
    done 30
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0381e+11 99.9 : 95.1 : 04.8 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    
    
    
    :fib             : 4.0425e+11 99.9 : 97.5 : 02.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 6.1248e+11 99.9 : 98.3 : 01.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 8.4501e+11 99.9 : 98.7 : 01.2 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.1022e+12 99.9 : 99.0 : 00.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.3043e+12 99.9 : 99.2 : 00.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.5109e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.8056e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0089e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.2213e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.4532e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    done 35
    
    
    
    31.340u 0.110s 0:36.92 85.1%    0+0k 0+0io 510pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    Top Demo for LightWeight Events
    here?shmfib
    done 10
    Opened ...
    done 20
    done 30
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0031e+11 99.9 : 95.6 : 04.3 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 4.0418e+11 99.9 : 97.8 : 02.1 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 6.1793e+11 99.9 : 98.5 : 01.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 8.4450e+11 99.9 : 98.9 : 01.0 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.1037e+12 99.9 : 99.1 : 00.8 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.3251e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.6119e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.8147e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0261e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.2590e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    done 35
    29.910u 0.180s 0:32.78 91.7%    0+0k 0+0io 510pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    Top Demo for LightWeight Events
    here?shmfib
    done 10
    Opened ...
    done 20
    done 30
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    
    ____________________________________________________________________________
    :fib             : 1.9855e+11 99.9 : 96.0 : 03.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 4.0033e+11 99.9 : 98.0 : 01.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 6.4510e+11 99.9 : 98.7 : 01.2 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 8.4536e+11 99.9 : 99.0 : 00.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.0956e+12 99.9 : 99.2 : 00.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.3136e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.5918e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.8135e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0372e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    done 35
    27.830u 0.130s 0:31.39 89.0%    0+0k 0+0io 510pf+0w
    
    The interesting thing here is that we don't think we are using all the CPU .. Perhaps it should be taking 27.8 seconds *user time* but with the extra Top and extra computer top work. Anyway, we are taking extra time with the ShowTop 32 seconds? (The 36 may be an aberration). Now, it's time to go back and revisit Chunky Events. (We have been playing with buffer sizes and such, so let's revisit Chunky Events so they have the same buffger size: 64K) Let's send Chunky Events to a process that does nothing but read's the events and throws them away.
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    32.390u 0.220s 0:32.58 100.0%   0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    32.810u 0.260s 0:33.27 99.3%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    done 10
    done 20
    done 30
    done 35
    33.580u 0.330s 0:33.96 99.8%    0+0k 0+0io 511pf+0w
    
    Here's what the machine is doing ... So, okay. Just sending Chunky events to a process reading the events is the same time as the lightweight events when we are doing "top!". Okay, now I feel better.

    Here's the LINUX top while it's running. The lreader, the "dummy" program is doing more work than the "sitereader" (for LW events). But, the python program moves as expected.
    109 processes: 106 sleeping, 3 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   33.6%    0.0%    1.0%   0.0%     0.0%    0.1%   65.0%
               cpu00   20.1%    0.0%    0.7%   0.0%     0.0%    0.3%   78.6%
               cpu01   86.0%    0.0%    0.3%   0.0%     0.0%    0.0%   13.5%
               cpu02   25.8%    0.0%    3.2%   0.0%     0.0%    0.2%   70.8%
               cpu03    2.7%    0.0%    0.0%   0.0%     0.0%    0.0%   97.2%
    Mem:  3094804k av, 3015980k used,   78824k free,       0k shrd,  270084k buff
                       1641912k actv,  745360k in_d,   51744k in_c
    Swap: 3145656k av,    8860k used, 3136796k free                 2247360k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    25960 rts       25   0  2704 2704  1740 R    24.9  0.0   0:09   1 python
    25959 rts       23   0  1000 1000   876 R     8.5  0.0   0:03   0 lreader
     3068 wal       15   0 34048  33M  2484 S     0.4  1.1  92:48   2 Xvnc
     3142 wal       15   0 14296  13M 11924 S     0.1  0.4   3:31   2 kdeinit
    23633 wal       15   0  8192 8156  4072 S     0.1  0.2   0:06   0 emacs
        1 root      15   0   508  476   448 S     0.0  0.0   0:20   2 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
       10 root      34  19     0    0     0 SWN   0.0  0.0   0:00   3 ksoftirqd/3
       13 root      15   0     0    0     0 SW    0.0  0.0   0:01   1 bdflush
    
    Now, let's send Chunky events to a thread in the same process which only reads them and throws them away. We expect about the same elapsed time: 33 seconds.
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    done 10
    done 20
    done 30
    done 35
    31.130u 0.230s 0:31.45 99.7%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % !!
    time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    done 10
    done 20
    done 30
    done 35
    31.050u 0.290s 0:31.45 99.6%    0+0k 0+0io 511pf+0w
    
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    done 10
    done 20
    done 30
    done 35
    32.500u 0.200s 0:32.78 99.7%    0+0k 0+0io 511pf+0w
    
    
    Interesting: we are a "touch" faster when we do the chunky events in a thread instead of another process. Perhaps because the context switch time is cheaper? Caching? Here's top:
     10:24:59  up 20 days,  2:58,  8 users,  load average: 0.92, 0.48, 0.30
    109 processes: 107 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   27.5%    0.0%    0.7%   0.0%     0.0%    0.0%   71.6%
               cpu00    5.0%    0.0%    0.8%   0.0%     0.0%    0.0%   94.2%
               cpu01   33.4%    0.0%    0.6%   0.0%     0.0%    0.2%   65.8%
               cpu02    6.5%    0.0%    0.5%   0.0%     0.0%    0.0%   92.8%
               cpu03   65.0%    0.0%    0.9%   0.0%     0.0%    0.0%   33.9%
    Mem:  3094804k av, 3020768k used,   74036k free,       0k shrd,  270084k buff
                       1647664k actv,  745376k in_d,   51744k in_c
    Swap: 3145656k av,    8860k used, 3136796k free                 2252508k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    26066 rts       25   0  2888 2888  1912 R    24.9  0.0   0:15   3 python
        1 root      15   0   508  476   448 S     0.0  0.0   0:20   0 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
       10 root      34  19     0    0     0 SWN   0.0  0.0   0:00   3 ksoftirqd/3
    
       13 root      15   0     0    0     0 SW    0.0  0.0   0:01   1 bdflush
       11 root      15   0     0    0     0 SW    0.0  0.0   0:29   2 kswapd
       12 root      15   0     0    0     0 SW    0.0  0.0   0:23   0 kscand
       14 root      15   0     0    0     0 SW    0.0  0.0   0:17   3 kupdated
       15 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 mdrecoveryd
    
    
    Now, the stuff. Let's have the other thread compute Top info without showing the top.
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    
    done 10
    done 20
    done 30
    done 35
    36.450u 0.240s 0:37.02 99.1%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    done 10
    done 20
    done 30
    done 35
    39.360u 0.230s 0:39.65 99.8%    0+0k 0+0io 511pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    done 10
    done 20
    done 30
    done 35
    36.440u 0.300s 0:37.03 99.2%    0+0k 0+0io 511pf+0w
    
    So, then computing top seems to really limit us. It has brought us up to 38 seconds.
    10:28:16  up 20 days,  3:02,  8 users,  load average: 0.60, 0.47, 0.32
    109 processes: 108 sleeping, 1 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   46.0%    0.0%    0.9%   0.0%     0.4%    0.4%   51.9%
               cpu00   70.5%    0.0%    0.0%   0.0%     1.9%    0.0%   27.4%
               cpu01   25.4%    0.0%    0.0%   0.0%     0.0%    1.9%   72.5%
               cpu02   29.7%    0.0%    1.9%   0.0%     0.0%    0.0%   68.3%
               cpu03   58.4%    0.0%    1.9%   0.0%     0.0%    0.0%   39.6%
    Mem:  3094804k av, 3019852k used,   74952k free,       0k shrd,  270084k buff
                       1647660k actv,  745372k in_d,   51744k in_c
    Swap: 3145656k av,    8860k used, 3136796k free                 2252504k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    26087 rts       25   0  2888 2888  1912 S    24.9  0.0   0:18   0 python
     3068 wal       15   0 39060  38M  7400 S     1.4  1.2  93:03   1 Xvnc
     3145 wal       15   0 11388  11M  9764 S     0.2  0.3   1:03   2 kdeinit
    26090 rts       20   0  1236 1236   868 R     0.2  0.0   0:00   2 top
        1 root      15   0   508  476   448 S     0.0  0.0   0:20   0 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
       10 root      34  19     0    0     0 SWN   0.0  0.0   0:00   3 ksoftirqd/3
       13 root      15   0     0    0     0 SW    0.0  0.0   0:01   1 bdflush
       11 root      15   0     0    0     0 SW    0.0  0.0   0:29   3 kswapd
    
    
    
    Finally, let's make sure the other thread shows the computed top demo.
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    done 10
    done 20
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.9027e+11 99.9 : 99.9 : 00.0 :            testshm.py:    2 
    done 30
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 3.9586e+11 99.9 : 97.2 : 02.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 6.1902e+11 99.9 : 98.2 : 01.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 8.4755e+11 99.9 : 98.6 : 01.3 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.0542e+12 99.9 : 98.9 : 01.0 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.3100e+12 99.9 : 99.1 : 00.8 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.5316e+12 99.9 : 99.2 : 00.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.8277e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0319e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.2477e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.4775e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    done 35
    36.820u 0.310s 0:37.38 99.3%    0+0k 0+0io 510pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    done 10
    done 20
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.9313e+11 99.9 : 99.8 : 00.1 :            testshm.py:    2 
    done 30
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 3.9928e+11 99.9 : 97.0 : 02.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 6.1877e+11 99.9 : 98.0 : 01.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 8.4832e+11 99.9 : 98.5 : 01.4 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.0517e+12 99.9 : 98.8 : 01.1 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.3116e+12 99.9 : 99.0 : 00.9 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.5323e+12 99.9 : 99.1 : 00.8 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.8320e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0363e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.2475e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.4772e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.6780e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    done 35
    38.180u 0.220s 0:38.59 99.5%    0+0k 0+0io 510pf+0w
    ( ~/hooks/ULMA2/Hooks/cprof )
    aidl380-01 % time python testshm.py
    event_queue_name = 
    shmfib!
    sizeof(AccumulatingTimes)=12
    done 10
    done 20
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.8913e+11 99.9 : 99.8 : 00.1 :            testshm.py:    2 
    done 30
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 4.0850e+11 99.9 : 97.4 : 02.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 6.1906e+11 99.9 : 98.3 : 01.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 8.5266e+11 99.9 : 98.7 : 01.2 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.0588e+12 99.9 : 98.9 : 01.0 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.3157e+12 99.9 : 99.1 : 00.8 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.5358e+12 99.9 : 99.2 : 00.7 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 1.8343e+12 99.9 : 99.3 : 00.6 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.0436e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.2602e+12 99.9 : 99.4 : 00.5 :            testshm.py:    2 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :fib             : 2.4903e+12 99.9 : 99.5 : 00.4 :            testshm.py:    2 
    done 35
    37.020u 0.230s 0:37.66 98.9%    0+0k 0+0io 510pf+0w
    
    Well, good. The extra "show top" work didn't hold us up too much.
     10:33:54  up 20 days,  3:07,  8 users,  load average: 1.48, 1.02, 0.60
    109 processes: 107 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total   46.3%    0.0%    0.6%   0.0%     0.0%    0.2%   52.9%
               cpu00   49.2%    0.0%    0.4%   0.0%     0.0%    0.4%   50.0%
               cpu01   78.6%    0.0%    0.6%   0.0%     0.0%    0.0%   20.8%
               cpu02   34.2%    0.0%    0.6%   0.0%     0.0%    0.4%   64.8%
    
               cpu03   23.2%    0.0%    0.8%   0.0%     0.0%    0.0%   76.0%
    Mem:  3094804k av, 3020520k used,   74284k free,       0k shrd,  270084k buff
                       1647696k actv,  745380k in_d,   51744k in_c
    Swap: 3145656k av,    8860k used, 3136796k free                 2252508k cached
    
      PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
    26126 rts       25   0  2920 2920  1944 R    24.9  0.0   0:19   1 python
    25332 wal       15   0 11024  10M  7656 S     0.2  0.3   0:22   3 gaim
        1 root      15   0   508  476   448 S     0.0  0.0   0:20   0 init
        2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0 migration/0
        3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1 migration/1
        4 root      RT   0     0    0     0 SW    0.0  0.0   0:00   2 migration/2
        5 root      RT   0     0    0     0 SW    0.0  0.0   0:00   3 migration/3
        6 root      15   0     0    0     0 SW    0.0  0.0   0:00   1 keventd
        7 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0
        8 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1 ksoftirqd/1
        9 root      34  19     0    0     0 SWN   0.0  0.0   0:00   2 ksoftirqd/2
       10 root      34  19     0    0     0 SWN   0.0  0.0   0:00   3 ksoftirqd/3
       13 root      15   0     0    0     0 SW    0.0  0.0   0:01   1 bdflush
       11 root      15   0     0    0     0 SW    0.0  0.0   0:29   0 kswapd
       12 root      15   0     0    0     0 SW    0.0  0.0   0:23   0 kscand
       14 root      15   0     0    0     0 SW    0.0  0.0   0:17   2 kupdated
    
    
    I feel much better: the lightweight events are cheaper than the heavyweight. What I don't understand is why the "computing top" actually slows down the monitored program. It MUST be the case that the monitor is not reading the events fast enough. Even the double buffer is getting in the way: I _think_ what is happening is that the event writer gets to the end of the double buffer, and it "waits" for the event reader to catch up at the end of every buffer. So, it spends some time waiting. We should not allow this for ULMA. We could make the buffers "bigger" so it happens less often, but then we have to worry about "thrashing" as buffers get too big. Of course, the reader _IS_ keeping up, almost. But, the writer of events can't even afford to wait a little foe the reader to catch up that little bit. I _think_ the solution here is for a "n-buffer" solution, "writer centric" where the writer always has an extra buffer it can write to. Thus, when the writer reaches the end of a buffer, it immediately jumps to the next free one (because the reader will only be working on one at a time anyways). There will always be a free buffer for the reader to immediately continue to, and when the reader catches up, it can make progress more quickly as well. So, I need to program this up to see if I can reach the "theoretical peak" of just having an empty reader.

    Thursday, November 4th, 2004

    So, not only did my implemetation of NBuffering slow it down, it disheartened me. It seems like the wait when trying to switch between buffers is what was preventing me from reaching my theoretical maximum. My implementation might just suck. I am going to play some more. Went back to my double buffer scheme: Seems to work well. I am going to stay with that? Anyway, I seem to have lost my focus: The "real" issue here is that we need to transmit lightweight events.

    To recap: fib demo with lightweight events (and full top), 30 seconds. fib demo with chunky events (and full top), 37 seconds. Original program, 18 seconds. To be fair, however, what we are sending over (without the special protocol) is the names and linenumber (which, by luck) don't go away. What we have to do now is implement the protocol idea, which is send the frame and making sure reference counting occurs.

    Friday, November 5th

    I am crazy! Insane! I have been cackling all morning at my newest activity: monitoring the monitor! My slwtopdemo isn't quite keeping up, and I was wondering "what was taking all the time". Well, I have a tool to do that: my topdemo! So, using the C++ TS constructs, I instrumented my slw monitor code. I have one process, with the python virtual machine running. Inside that process, I have the "fib" demo running, sending slw events to a monitor: which is a kernel thread (POSIX) reading the events. That thread has been instrumented and is sending HeavyWeight events to the topdemo in another process, so I am watching it! Here's what I see:
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :SLWHandleEvent  : 1.1619e+11 58.9 : 51.5 : 48.4 :          slwtopdemo.h:    9 
    :LWHandleReturnE : 3.3192e+10 16.8 : 00.0 : 99.9 :           lwtopdemo.h:  208 
    :LWShowTop       : 2.1074e+10 10.7 : 00.0 : 99.9 :           lwtopdemo.h:  108 
    :LWHandleCallEve : 1.5498e+10 07.9 : 00.0 : 99.9 :           lwtopdemo.h:  172 
    :SLWHandleEvent: : 1.1218e+10 05.7 : 00.0 : 99.9 :          slwtopdemo.h:   21 
    **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE#
    ____________________________________________________________________________
    :SLWHandleEvent  : 1.2265e+11 58.9 : 51.5 : 48.4 :          slwtopdemo.h:    9 
    :LWHandleReturnE : 3.5071e+10 16.8 : 00.0 : 99.9 :           lwtopdemo.h:  208 
    :LWShowTop       : 2.2223e+10 10.7 : 00.0 : 99.9 :           lwtopdemo.h:  108 
    :LWHandleCallEve : 1.6370e+10 07.9 : 00.0 : 99.9 :           lwtopdemo.h:  172 
    :SLWHandleEvent: : 1.1846e+10 05.7 : 00.0 : 99.9 :          slwtopdemo.h:   21 
    
    Basically, half of our time is building the event (the %PROC in SLWHandleEvent) and the other half is in either LWHandleReturnEvent or LWHandleCallEvent. You'll see that LSHandleReturnEvent does more work.

    Thursday, November 11th

    Huh. I thought I put an entry in earlier this week. Oh well. Let's summarize what I've been doing this week: Bumping my head against a wall. I _did_ get the protocol working, but it's not as good as I had hoped. Here's what I am seeing (with testshm, the fib demo).
    Super lightweight events, in threads (with top)
    34.520u 0.940s 0:39.43 89.9%    0+0k 0+0io 490pf+0w
    Lightweight Events, in threads (with top)
    26.840u 0.700s 0:28.98 95.0%    0+0k 0+0io 483pf+0w
    Chunky Events, in threads  (with top)
    38.230u 1.720s 0:40.21 99.3%    0+0k 0+0io 483pf+0w
    
    Slightly better than SLW is slightly better than Chunky, but LightweightEvents are still much cheaper. What I can work on: Making the super lightweight demo a little bit cheaper. How: Find the right choice for throughout and latency for buffer sizes. Also, can I make the protocol on the monitored program side cheaper? (I.e., a little code tuning). I did use one of my observations to my advantage: I only have to build a chunky event (under SLW protocol on monitor side) when I get a CALL event: All the return cares about is the timestamp in some sense.

    Hmmm ... while thinking about how to make the protocol faster, here's some more data for test programs. First, the text to pdf:
    Super lightweight events, in threads (with top)
    21.340u 0.050s 0:21.59 99.0%    0+0k 0+0io 510pf+0w
    Lightweight Events, in threads (with top)
    19.110u 0.080s 0:19.46 98.6%    0+0k 0+0io 503pf+0w
    Chunky Events, in threads  (with top)
    20.480u 0.280s 0:20.81 99.7%    0+0k 0+0io 502pf+0w
    NO EVENTS (just the program without profiling)
    16.440u 0.060s 0:16.77 98.3%    0+0k 0+0io 501pf+0w
    
    Here's the sieve of Erast.
    Super lightweight events, in threads (with top)
    seg fault
    Lightweight Events, in threads (with top)
    18.770u 0.030s 0:18.96 99.1%    0+0k 0+0io 486pf+0w
    Chunky Events, in threads  (with top)
    17.330u 0.060s 0:17.56 99.0%    0+0k 0+0io 485pf+0w
    NO EVENTS (just the program without profiling)
    17.060u 0.010s 0:17.11 99.7%    0+0k 0+0io 484pf+0w
    
    Argh. I have a bug in the SLW. A bigger text to pdf doc:
    
    NO EVENTS
    48.340u 0.060s 0:48.59 99.6%    0+0k 0+0io 501pf+0w
    
    Chunky Events, using threads, with top
    60.940u 0.750s 1:01.93 99.6%    0+0k 0+0io 502pf+0w
    
    Lightweight Events, using threads, with top
    55.040u 0.220s 0:55.52 99.5%    0+0k 0+0io 503pf+0w
    
    SuperLightweight Events, using threads, with top
    59.820u 0.200s 1:00.22 99.6%    0+0k 0+0io 511pf+0w
    
    

    November 10th, 2004

    I've been out of the loop working on some RRC stuff, but also moving to a machine that's all mine so I can do some testing. I found a few bugs in the protocol, and fixed them up. Lo and behold, things are working and working well! I still need to write the protocol which deletes at boundaries (this will probably work better in UNICON), but I am pleased I got the performance I expetced. The interesting thing is that "top" takes up a lot of time and actually slows down the generating process and looks slower. With just the protocol, we apprach optimal speed. With the top demo, becauise we are building events in the "top" thread. Hm, another crazy idea. Use three threads. One thread which does the protocol with super lightweight events. Another thread which reads the lightweight events and turns them into Chunky Events, and finally a thread which takes the chunky events and stores them in data structures. Edit Me