Spent some thinking about next steps as well as shepherding some paperwork through here (and talking to Clint and Steve Z about it).
(1) Can we demonstrate the ability to use Clint's Unicon visualization tools from Python? (2) Let's get inside the interpreter ... (3) Can we write a tool using the sys.profile hooks already there? This might the best thing to do to show the IRAD committee we can make progress.
We can get inside the "old" profiler, but we can't track C routines. This might be the thing to do to get a tool going. but, the newer, "better" profiler is all written in C (hotshot) and we probably need to understand that. That's a little harder.
My current proposal is that Clint and I work on (2) concurrently (so we can both understand the internals) while I work on (3) and Clint work on (1) so we can make progress across all 3 axis.
The idea for a paper is, hopefully, that maybe we can do cross-language profiling inside of Python and the C extensions. Show we can do it and maybe even have some reasonable tools.
For future monies, we can demonstrate that we can port some of Clint's Unicon tools, get some stuff going and say "look we can do it ... can we get more money to make this a real product for Python and X-Midas?"
Thoughts? Clint? Derek?
Hey, look at this paper. Maybe we don't need python hooks. http://www.python.org/pycon/dc2004/papers/1/presentation.pdf No seriously. Isn't this Ken Walker's thesis?
There are some interesting points from the talk.
(1) Converting python code to C "directly" with p2c doesn't give much speedup. It's getting rid of the dynamic dispatch. (2) Numeric is hard to use ... it would be nicer to use Lists (!) (3) Writing in C may not be the thing to do.
Now Self (an old Stanford dynamic system) used similar techniques: specializing code for speed. Psycho does the same thing. The times I have seen "How to speed up Python" invariably offer two techniques: (1) Use C or (2) Make "x" a local instead of global or namespace. This talk validates that "lookup" and dynamic lookup are bottlenecks for our programs. Which may offer us more direction: Perhaps Python Hooks should be trying to time the dynamic dispatch?
I talked to Clint last night, he agrees: it sounds pretty vaporous. Moreover, Ken Walker's work went back into Icon and was much more real and thought out. But yes, it's a good idea. Both Starkiller and Psyco use the same ideas.
Philosophy: Consider, the Linux kernel is written in C because, according to belief, it needs to be as fast as possible and we are willing to spend the extra time writing in C for that. There is also a philosophy, which we seem to be adherent to for this project, is that most of your program can be written in Python then write the few deep parts in C/C++. These two philosophies are at odds with each other slightly.
If the "write high 90%, write low 10%" philosophy were true, then the Linux kernel would be written in mostly Python. Is that a good idea? Maybe.
We also believe that tight C/C++ data structures are important for speed (thus the whole C Linux kernel).
I am not trying to go off into left field here but point out something: there's something missing in characterizing performance. Perhaps because a kernel is a different type of entity that a DSP app or database, this may not apply. But I don't think so. The whole StarKiller and Pysco python compilers point out that "code specialization is good", but how is this characterized? THAT is the question I think we need to answer. Gilbert points out that function dispatch could be a lot faster if they specialized for string lookup. How do we know this? We need tools for this.
This reminds me of the early midas 2k days when memory was a bottleneck of shared memory CPUs, but we didn't have any tools, reallly, to determine it. It became true computer science: Postulate that malloc is the bottleneck, try experiments (rewrite massive amounts of code) to determine this. This worked, but it was a mess. Why didn't our tools point this out? What tools would we have wanted to point this out? (To revisit, the same problem happened with too much locking in the initial M2k baseline. What tools would have helped us find this?)
Blue sky mode on: Wwaiting. We somehow needed to know we were waiting: for a socket, for a malloc, for a "too much locking" point. There seems to be a bunch of things we could wait for: (1) CPU: there are so many threads, a thread may just be waiting to be scheduled. The thread is waiting for CPU. Or a CPU bound job has the data but just can't process it fast enough. (2) Data: A thread is waiting for a buffer to be filled by DMA, disk, socket
There seems to be a mutual relationship here. Either I am being waited for, or I am waiting for something. A CPU bound job is waiting for CPU, the results are being waited for by a consumer. In an ideal world, there is no waiting for data, no waiting for CPU.
If we expand this idea to the StarKiller metaphor, we are waiting for Dynamic Dispatch meaning we are 'waiting' for resolution. of names/lookups. Why don't we see this in our tools? Because it is small grained, a very small amount of overhead for a hash table lookup at a single site, but amortized over the life of a program, it's huge.
So, how can we find the Starkiller overhead? We would need to be able to see all dynamic dispatch events and charge them all to the same place. We need to be able to see that we are waiting for the dynamic dispatch. So we need to classify and aggregate. But that begs a different question, how do we classify? We know, maybe from our experience, maybe from some real "computer science" that dynamic dispatch is slow. But how could we have classified this apriori?
Our profile tools tell us how much time we spend in a routine, they even tell us how long they are on the stack. We need to know how long we wait for a resource OR how long we are waited for.
(Another thought: how do we "detect" inline code? This may not be relevant in a Python program, but a "sequence of code" (like function call overhead) is spread out over an entire program, but it's all the same kind of code: In this case, moving registers in and out of memory. If we "inlined" just a certain type of code to avoid this function call overhead, we can save. But again. How?)
More blue sky: most profilers will give us::
while (i<100) {
90% a[ i ] += sin(i++);
}
2% u.lock();
5% u.recount++;
3% u.unlock()
But we need to know lock got swapped out... we actually waited for 5 minutes
until we swapped backed for an activity that was only charged 2% of time.
Maybe we need "actual time vs. CPU time"? Context switches of the machine.
How about a display like this? A box which is the total CPU time in the
routine, with red being actual CPU time. A cpu bound routine would have this
being all red, a waiter would be almost all blue (blue being the color of
actual time).
Okay, but how do you charge callees? Maybe a callee returns both the actual time and CPU time and we can view callees either way. Actually, we probably do so we can characterize what was the biggest wait point. I can imagine two graphics, a little box on the side with red/blue percentages or different sized arrows pointing in and out (the bigger the arrow the more time waiting).
That tool might have been useful for the locking or malloc problem, but what about the dispatch problem.
So, "overhead" is the time between the call and the time you actually execute the first statement of the called function. For inline functions, that would be nothing (or maybe 1 or 2 statements for copying depending on how you count it), C functions would be the time to do a function call context switch, C++ virtual functions would be C overhead plus indirection. Python calls would involve time to run the virtual machine, do the lgb lookup.
So, how about this: ideally we would like a little "progress bar" by every function with 3 colors in it: red being the amount of "actual CPU time", yellow being "call overhead", and blue being "time spent waiting": Add up all three times and you'd get the total amount of time spent in the function.
Then of course, we have to be able to ask for agreggates of data: how much time was spent TOTAL in function call overhead for this section of code, or this call, how much time was spent waiting in routines named "fred", how much CPU compute was spent in "compute".
So, I am imagining a tool like quantify, except that it's available at run-time. You can see callers, calless, the times called plus all the info we ask for for certain routines. We turn on or off as we run to see more or less info.
More thoughts: The C extensions can't (at this point) be measured with the debugger. So, there seems to be this "can't measure" portion of the process. Maybe if a routine calls a routine that can't be measured, the "can't be measured" portion should be clear (white). This metaphor could extend through the whole program: we only want to measure certain routines and don't care about profiling the performance of ancilliary routines, so even those contribute to the total overrunning time of a function their "contribution" to the performance bar would be "white" meaning, ya some time was spent there, but we don't know how much was spent in waiting,CPU or overhead. Turning off and on these is of course, important.
More thoughts (I am thinking way too much today): Clint and I had this discussion about threads and the architecture of using Alamo and Python in concert, how should they communicate? Socket? Co-routine? Shared Memory? Here's an important thought: WE WANT TO PROFILE WITHOUT TOO MUCH IMPACT ON PERFORMANCE. If an application does not use an entire 4 cpu machine, the profiler should sit in another thread/process and take "simple" events and turn them into the runtime monitor, ie., getting a "compute called at 4:45", "compute returned at 4:46" and take these events over into the concirrent monitor which takes them and "visualizes" them (sorts the events, puts them in a display), but for this to be a "good" system, it can't interfere too much... a truly concurrent task could do all the processing "orthogonally" and visualize without too much interference. Ugh. Is this going to be a problem because Python is not multithreaded? It sounds like we may have to have two python processes talk over shared memory. (Again, Python should be thread-safe/hot/re-entrant)
"demoed" the ideas discussed in this Wiki page with Derek. Wish Clint was here to bounce ideas off of, but I will send him this page. Colors were important, they showed how I was thinking. Here's a little more formal (and hopefully clearer) summary.
Things To time
A display would have a little bar next to each source code line showing how it spent its time for each function. At the top of the function would be a progress bar made up of the various colors showing the "total" time spent in the function doing each activity.
As Derek pointed out, we want a "demo at every point". Activity for Richie: using existing framework in Python, which only reports stats to file, make it so it reports at run time. Slowly add features to this: allow us to move profile events to another machine for visualization, allow us to move profile events to another process for visualization, create better and better tools.
Plan of attack
Intercept profile events, figure out how to report them at run-time. May involve crude TCL display Consider architecture for shared memory dispatcher so we can "offload" profile events to another thread or machine or process (Maybe at this point, We could talk to Alamo and Unicon) Figure out how to compute CPU time (timing unencumbered Python codes), integrate info back in Better displays
Goal, try to keep a demo available at all times after initial capability.
Get the xmpy280.tgz. untar unzip cd xmpy280 setenv XMPY_ROOT $PWD cd unx ./configure make -f Makefile_support source $XMPY_ROOT/unx/pyenv.csh
Some more thoughts; One of the most important things that came out of writing the Shm module was understanding how the Python Type Objects work. These objects are basically giant "jumping off points" for creation/initializtion/destruction routines all call. This will be a PERFECT place to add new hooks besides the standard sys.profile hooks. I know that some of Clint's UNICON tools count (for example) all string creations. If we "wedge" in the tp_init method in the PyStringType object, we can intercept all string creations and get an idea/visualizatioj about how many were called!
Wedge: An old Operating Systems term for a "linear string" of calling routines that, when finished, would call the next one. We can probably easily turn on/off wedging/instrumentation (esp. since Python has a giant global lock) by wedging routines WHILE PROGRAMS ARE RUNNING.
So, I suppose my next task is to take the Python profile and write out a TEXT log file so that UNICON can read it.
I am more convinced that moving the profiling to another thread machine or process is the right thing to do, and also why we don't see more of the profiling module. In my tests for above (I did get Python spitting out some text), the first few times I did anything, I messed up and I was profiling when I was putting things inside the strings! I.e., I was profiling the profiling .... And, it feels (based on the interface given to us) like turning off the profile hooks while we are doing profile work (for visualization) would be difficult or invasive, etc. Now, my guess is the newer profile modules in Python 2.4 are written in C (like the hotshot module from 2.3.3) and is a little better at "staying out of the way" of Python. BUT, I do think the right thing to do is to shoot information over to another process and stay OUT OF THE WAY of the the real Python appplication is running. Otherwise, we will be profiling the profiling the profiling ...
I am currently downloading the Python 2.4 alpha to see how they do profiling there.
What events do we want to instrument (in C)? Time to go back and review our original goals/requirements meeting notes.
Because we have to stringize everything, Richie is worried that the performance might be slow, but he has not demonstrated that this is actually a problem yet. The Alamo style of whole-program shared memory instead of a little window of shared memory might be cheaper since things don't have to be stringized. Comparing and contrasting the relative performance and capabilities of these two approaches might be a decent research paper.
We are currently going to use the SharedMemory module I wrote last week, but surprisingly the POSIX shm_open isn't implemented on every machine here at NMSU?? So, I am zooming over to a Solaris 9 machine .. currently I am porting OpenContainers to Solaris 9 (not much work, very easy) (remember, the SharedMemory abstraction depends on the OpenContainers).
The hope is that by the end of the day, I will be passing some Shared Memory to Unicon.
GRUMBLE! Building XMPY on Solaris 9 gave me some issues, but I think it's because we are using a "non-standard" environment (NMSU) where make and gcc are in very different places, so I had to go inside of makefile_install and "expand" the path so that it could find the cc compiler and make. Weird!!!!
XMPY built on Solaris 9. OpenContainers builds on Solaris 9.
Some good news: The shared memory stuff seems to work on Solaris (outside of Python at least).
Currently moving the SharedMemory module so it works a python C module so I can build it directly under XMPY.
GRUMBLE: Spent some time trying to get C++ to link the shared library under python 2.2 and couldn't get it to work (went in and hardcoded the SunCC compiler in unixccompiler.py and it linked with C++, but still had issues). Don't know why. I had this working with python 2.3.4. I am bailing on XMPY, I have already wasted half of today trying to build my extension module on XMPY, when I know it works under Python 2.3.4. It may be that Solaris 9 is killing me, because I don't think XMPY has been built under Solaris 9. What this means: I may have to manually install TIX and TCL and Numeric to use the wireframe.py demo, or I can make progress without that particular demo for now (which is what I am probably going to do).
GRUMBLE: Okay and fought with and rebuilt python 2.3.4 with g++ and it appears I can finally link with the shm stuff on SOlarus BUT NOW readline doesn't work !! ARGHGH! Also spent some time chasing a seg fault: perror with ":invalid arg" to shm_open: you need a '/' at the beginning of your name on Solaris. The good news is that now I have a python that works with the SharedMemory module and Clint has a version of UNICON that works with it on the same machine. We should be able to talk soon!
Tested and got a Python talking to Python on the same machine (but not with XMPY). Basically, I had to make sure that I used gcc and g++ (and configure --with-cxx=g++) to make sure it worked. Anyway, it seems to. [We did find a bug or two; on Solaris with POSIX shared memory, you have to have a '/' at the front and if you don't, the shm_open fails ... and I don't catch the error very gracefully. I need to fix that.] Talked to Derek last night about the Grad RA ship that we are funding, we would prefer an American citizen only so we can potentially hire him. Clint sent out an announcement to csgrads.
After last night's success building, I tried XMPY again. Nope. The configure script for XMPY doesn't seem to use the --with-cxx=g++ flag. It automatically uses the default CC and cc for Solaris.
Permissions issue: on Solaris, permissions are a bit sticky on the shared memory file, which lives in /tmp/SHMD* rather than /dev/shm. Setting the umask, and manually removing an older, permission-protected copy, were necessary to get writer/reader talking from separate userids.
We couldn't get UNICON to Python to talk at first (we think there's a bug), so I backed off and tried having the python profiler talk to a python program. No problems there. This points to *I think* a build issue of how Python builds things and how UNICON builds things.
[BTW, prstat is the perfmeter/top on Solaris 9]
Okay, a few things: Moved to Clint's house and running on SUSE Linux.
Got Python 2.3.4 built and then SHM extension module built. Works.
Clint can do a UNICON to UNICON shared memory that works.
We have found a few places where he was treating the opaque impl
as a FILE* so that might explain why we had so many issues under Solaris.
Some things to fix:
All perrors need to throw an exception (some of the permission error ones
don't). Unresponsive to CTRL-C: Need to make sure a CTRL-C will
disconnect something some shared memory ... also need to add a piece
of shared memory as a parameter so that we can tell when CTRL-C is pressed.
Clint is currently tracking down the errors.
So, another one of my goals next week is to figure out how to send events from Python: Intercepting the TypeObject stuff.
Clint and I need to "bargain" and figure out how to send him the events.
/* Macro, trading safety for speed */ #define PyString_AS_STRING(op) (((PyStringObject *)(op))->ob_sval) #define PyString_GET_SIZE(op) (((PyStringObject *)(op))->ob_size)
time python testshm.py done 10 done 20 2.522u 0.775s 0:32.14 10.2% 0+0k 0+0io 0pf+0w
[jeffery@localhost python_code]$ time python testshm.py done 10 done 20 6.987u 1.289s 0:39.05 21.1% 0+0k 0+0io 0pf+0wResult: Slower by about 20%, and user time was twice as much in the Python hook. On a multiprocesser, when things could proceed indepently, we could probably (pure speculation) get the C version going twice as fast?
time python testshm.py done 10 done 20 1.736u 0.967s 0:29.99 8.9% 0+0k 0+0io 0pf+0w [jeffery@localhost cprof]$about 1/2 the user time, about 5% faster: Obviously, X is the bottleneck! Again, on a multiprocessor, when X, Icon, and Python can proceed independently, I think we'll see a major speedup.
time python testshm.py done 10 done 20 done 30 3.094u 0.040s 0:03.28 95.4% 0+0k 0+0io 0pf+0wNow, we'll JUST turn on the C profiling events: Nothing happens except an empty stub gets called.
time python testshm.py done 10 done 20 done 30 3.579u 0.031s 0:03.75 96.0% 0+0k 0+0io 0pf+0wNow, let's build the event to send, but not actually send it (drop it on the floor)
time python testshm.py done 10 done 20 done 30 3.860u 0.029s 0:04.03 96.2% 0+0k 0+0io 0pf+0wNow, let's send the event, but the other side is ONLY reading the event as fast aspossible and dropping it on the floor.
time python testshm.py done 10 done 20 done 30 29.316u 34.531s 2:26.68 43.5% 0+0k 0+0io 0pf+0wStrip the event to 4 bytes and other wise is reading and dropping on the floor.
time python testshm.py done 10 done 20 done 30 29.849u 34.733s 2:24.28 44.7% 0+0k 0+0io 0pf+0w
/* Object and type object interface */ /* Objects are structures allocated on the heap. Special rules apply to the use of objects to ensure they are properly garbage-collected. Objects are never allocated statically or on the stack; they must be accessed through special macros and functions only. (Type objects are exceptions to the first rule; the standard types are represented by statically initialized type objects, although work on type/class unification for Python 2.2 made it possible to have heap-allocated type objects too). An object has a 'reference count' that is increased or decreased when a pointer to the object is copied or deleted; when the reference count reaches zero there are no references to the object left and it can be removed from the heap. An object has a 'type' that determines what it represents and what kind of data it contains. An object's type is fixed when it is created. Types themselves are represented as objects; an object contains a pointer to the corresponding type object. The type itself has a type pointer pointing to the object representing the type 'type', which contains a pointer to itself!). Objects do not float around in memory; once allocated an object keeps the same size and address. Objects that must hold variable-size data can contain pointers to variable-size parts of the object. Not all objects of the same type have the same size; but the size cannot change after allocation. (These restrictions are made so a reference to an object can be simply a pointer -- moving an object would require updating all the pointers, and changing an object's size would require moving it if there was another object right next to it.) Objects are always accessed through pointers of the type 'PyObject *'. The type 'PyObject' is a structure that only contains the reference count and the type pointer. The actual memory allocated for an object contains other data that can only be accessed after casting the pointer to a pointer to a longer structure type. This longer type must start with the reference count and type fields; the macro PyObject_HEAD should be used for this (to accommodate for future changes). The implementation of a particular object type can cast the object pointer to the proper type and back. A standard interface exists for objects that contain an array of items whose size is determined when the object is allocated. */Why is this important? (1) One question Clint and I had: Do objects in Python move around? The answer is NO (according to the comment anyway). Why is this important to us? When reporting an event from Python to Alamo, we normally have to send "some info", esp. to distinguish lists from each other. Well, we can just use the pointer as a handle to the list so we know which list we are operating on without much work. (2) Another reason this is important: More discussion of TypeObjects. Any and all discussion of those things is good.
static PySequenceMethods list_as_sequence = { (inquiry)list_length, /* sq_length */ (binaryfunc)list_concat, /* sq_concat */ (intargfunc)list_repeat, /* sq_repeat */ (intargfunc)list_item, /* sq_item */ (intintargfunc)list_slice, /* sq_slice */ (intobjargproc)list_ass_item, /* sq_ass_item */ (intintobjargproc)list_ass_slice, /* sq_ass_slice */ (objobjproc)list_contains, /* sq_contains */ (binaryfunc)list_inplace_concat, /* sq_inplace_concat */ (intargfunc)list_inplace_repeat, /* sq_inplace_repeat */ };Notice that list_item and list_ass_item are "helper" routines that DO NOT call PyList_GetItem/PyList_SetItem. The point here, accesses can happen ALL OVER THE place.
PyList_New: A "global helper" routine available in the listobject.h. Since it is exposed, I guess anyone can create one. If you start looking through listobject.c, you'll notice it getting called from multiple points for the list helper routines. It seems to be an "easy" way for the VM to create a list without jumping through the TypeObject . Jump though Type Object: A sequence for creating/destroying a list would be PyList_Type : tp_new (PyType_GenericNew): which only calls type->tp_alloc tp_alloc (PyType_GenericAlloc) : allocates and zeros memory tp_init (list_init) : Dizzying array of calls to list_fill and list_ass_slice. There is a macro which resizes lists NRESIZE as well as other random assortment. (List is now created ... some later time is GCed) tp_dealloc(list_dealloc) : PyMemFREE on ob_item also call op->ob_type->tp_free on object tp_free (PyObject_GC_Del) : (over in Modules/gcmodule): calls PyObject_FREE on PyObjectOkay, this brings up a similar question from the investigation the other day. Clint calls this the difference between "source level" events and VM event: The PyList_Type sequence is called probably when a piece of Python code creates List. The other routines, which create sublists, etc., seem to be more "internal helper" routines for creating lists WHICH ARE STILL LISTS!
aidl380-01 % time python testshm.py done 10 done 20 done 30 73.220u 12.410s 1:25.77 99.8% 0+0k 0+0io 487pf+0wThe recompile in the middle moves us back to -O3 so we can compare like results. Hm. I am surprised: We aren't getting as much speedup as I hoped from building the event directly to the buffer.
2_CPU machine, one process generating fibonacci events to another process using C Profile hooks. Same scenario, but using C profile hook. 24.9% python, 24.9% lreader.aidl380-01 % time python testshm.py done 10 done 20 done 30 9.290u 15.430s 0:24.88 99.3% 0+0k 0+0io 492pf+0wSo, the C profile hook seems 3 times faster (wall clock time), but the amount of user CPU is almost 10x difference!
Let's come from the other direction again: Time JUST the fib sequence, no profiling at all.aidl380-01 % time python testshm.py done 10 done 20 done 30 1.460u 0.010s 0:01.49 98.6% 0+0k 0+0io 454pf+0wTurn on C profiling, but ONLY the subroutines, don't actually do anythingaidl380-01 % time python testshm.py done 10 done 20 done 30 1.610u 0.000s 0:01.62 99.3% 0+0k 0+0io 489pf+0wAdds very little to the time, even WITH all the events (fib(30) is a LOT of events).
Now just build the event, but dont' send it over shared memory.aidl380-01 % !ti time python testshm.py done 10 done 20 done 30 1.920u 0.000s 0:02.19 87.6% 0+0k 0+0io 489pf+0wA little more time, but not too bad.
Now, have a NULL reader on the other end, just reading and throwing away the bytes. Okay: Here's part of the problem, both should be taking up 10o% of a CPU and they are NOT! Both only take up 24% each.aidl380-01 % time python testshm.py done 10 done 20 done 30 9.180u 18.790s 0:28.31 98.7% 0+0k 0+0io 492pf+0wSo, 10x slower ... which is better than the 40x slower we saw, but we should be at worst 2x slower. Sigh. Yes, synchronization is KILLING US!!
After more thought, the shm as it stands was written for LARGE SIZE transfers, not teeny transfers. That may be part of the problem. It needs to have some special cases for small size transfers?
ARGH!!!! Sometimes I hate "TOP"!!!! The old versions used to report processes when they were >100%, but the new version by default WON"T!!!!!! ARGH!!!! So, some new % to report (WHICH MAY INVALIDATE THE % in the PREVIOUS TESTS BACK AT RRC!!!!!!! Note that the timings are still good). Running on a machine where python reports to an empty null reader IS 25% each, but on a 4_CPU machine, that IS a full CPU each... an a LOT of time is being used by the SYSTEM (context switching), almost 100%!!!! Here's the actual snapshot:CPU states: cpu user nice system irq softirq iowait idle total 19.5% 0.0% 30.6% 0.0% 0.0% 0.1% 49.7% cpu00 10.0% 0.0% 16.6% 0.0% 0.0% 0.0% 73.4% cpu01 11.6% 0.0% 21.2% 0.0% 0.0% 0.4% 66.8% cpu02 30.4% 0.0% 50.2% 0.0% 0.0% 0.0% 19.4% cpu03 26.2% 0.0% 34.4% 0.0% 0.0% 0.0% 39.4% Mem: 3094804k av, 977048k used, 2117756k free, 0k shrd, 43340k buff 499652k active, 260908k inactive Swap: 3145656k av, 0k used, 3145656k free 611604k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 20980 rts 25 0 844 844 728 R 24.8 0.0 0:25 3 lreader 20981 rts 25 0 2772 2772 1672 R 24.7 0.0 0:26 2 python 2330 wal 15 0 45452 44M 2508 S 0.1 1.4 52:36 1 Xvnc 2421 wal 15 0 13916 13M 10932 S 0.1 0.4 0:12 3 kdeinit 1 root 16 0 512 512 452 S 0.0 0.0 0:05 2 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1 4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2 5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3 6 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 keventd 7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0So, this IS actually (partly) what we want: Two CPUS being totally maxxed which the twp processes talk. Of course the problem is that an entire CPU of time is being used for context switching back and forth. Ugh. It would be great if we could do coroutines like we did in ICON (which Clint found to be 10x faster than his POSIX thread version of co-routines). Part of the issue (discussed earlier) is that it wakes up, checks then imeediately goes BACK TO THE OTHER PROCESS!
Anyway, that's why this is research, eh?
Talked to Clint on the phone: We had to "unicon alamo.icn" to make sure it worked: Apparently, this wasn't rebuilt as part of the makefile? Anyway, go the radial thing working. Great! Another demo!
Also got some news: Micheal found some more instrumentation points. ! GREAT!
Wednesday, September 8th, 2004
Today: Going to try to get the speed up up the cprofile.
Demoed to Bob Arendt: he thought the visualizations are pretty but need more info. In particular, what he wants is "something like top" where the runtimes are displayed while the program is running. This is the SECOND time I have heard this in two days! Derek said the exact same thing the other day. Perhaps THAT is exactly what we need to be writing the Python people about? (As I say this, I am realizing I need to spend some more time looking at the hotshot c cprofiler ...)
Thursday, September 9th, 2004
Thought more and more about how to fix up the shared pipe and came up with a double buffering idea: Don't synchronize except on full buffers (i.e., end a whole buffer over). Ideally, we'll be able to tune "automatically" so that the double buffer can be triple/n buffered for the reader or writer, depending on who is the faster/slower one. Have something going right now, but of course it is seg faulting.
Friday, September 10th, 2004
This morning, got the "reader/writer" demo working: Currently we have the same interface as the Piper so I should be able to just plug it in "as is" to the profile stuff I did before. I can actually probably make it faster by writing directly to the buffer (rather than building the event THEN copying it to the buffer), but let's get it working first then tune it some more.
Allight, we got it working with CProfile hooks and a simple double buffering scheme that buffers the events and does minimal synchronization. Top shows: 24.9% time in python (1 full CPU), 10.8% time in lreader (the NULL reader) and maybe 2% time in system time! We are much much much faster than before!aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 28.440u 0.360s 0:29.03 99.2% 0+0k 0+0io 481pf+0w ( ~/hooks/ULMA2/Hooks/cprof )Compare this to NO PROFILING HOOKS AT ALL!aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 17.640u 0.030s 0:17.94 98.4% 0+0k 0+0io 441pf+0wSo we're about twice as slow with hook. (odd, it's about a full CPU plus another 25% of a CPU? That might just be top).
Let's time the cost (again) of the event instrumentation without ANY WORK being done (just immediate return).aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 18.670u 0.010s 0:18.87 98.9% 0+0k 0+0io 477pf+0w
Now let's time it with building the event, but NOT putting it to shared memory.aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 23.640u 0.030s 0:23.90 99.0% 0+0k 0+0io 477pf+0w
revisit the double buffering solution.aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 28.110u 0.540s 0:28.71 99.7% 0+0k 0+0io 481pf+0w ( ~/hooks/ULMA2/Hooks/cprof )
Gave demos of nova to Derek and and Stephan. Afterwards, Derek and I did a lot of work and had some new ideas.(1) Add a "tics" number as a timestanp to the event: Rather than have the profiled profile set off timers/turn them off, send over a number (quick to read of current time) and have the monitor use ITS CPU to compute times with it (2) Have a single step mode (3) Dial for frequency of reporter and monitor (4) Instrument dictionaries (5) "Make TOP" for Python Routine usage (6) Go through the interpreter for cleaning up?So, Derek and I demoed wireframe using the "nova" demo as well as the "buildtree" demo. Very cool! We had to build tk and tics and such (it was good to see how to do that outside of XMPY) and then just used the tools and they worked! Derek was disappointed that the hot connect/disconnect doesn't work yet. That's still to do: The original idea was "let's get it going" first, then add features. Right now, we have a couple of things we can demo. But of course, we want better demos and more robust demos.Monday, September 13th, 2004
For part of today, I want to build the event directly to the shared buffer. This way, we "should" be able to get event recording cost down to the level BEFORE we were event putting it to shared memory.
Here's with building the event directly in memory (without an intervening buffer and extra memcpy). Difference: -O5aidl380-01 % !tim time python testshm.py done 10 done 20 done 30 done 35 25.960u 0.450s 0:26.58 99.3% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 26.280u 0.410s 0:26.90 99.2% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % make -f Makefile.linux clean /bin/rm -f *.o *~ libcprof.so *.so ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % make -f Makefile.linux g++ -g -O3 -DLINUX_ -DOC_BYTES_IN_POINTER=4 -DOC_NEW_STYLE_INCLUDES -Wno-deprecated -I../opencontainers_1_3_4/include -I../Python-2.3.4/Include -I../Python-2.3.4 -I../sharedmem -c cprof.cc g++ -shared cprof.o -L../sharedmem -lrt -lshm -o cprof.so ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 27.430u 0.440s 0:28.01 99.5% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 26.290u 0.330s 0:26.63 99.9% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 %
29.010u 0.520s 0:29.71 99.3% : CReport Func, Build Event lookaside, DB put 26.290u 0.330s 0:26.63 99.9% : CReport Func, Build Event Inplace, DB put 23.640u 0.030s 0:23.90 99.0% : CReport Func, Build Event lookaside, NO PUT 18.670u 0.010s 0:18.87 98.9% : CReport Func, nothing else 17.640u 0.030s 0:17.94 98.4% : NO INSTRUMENTATIONLet's try one more test, let's "pretend" we can just pass the pointer to the stack frame of interest so that building the event is "trival" (8 bytes of something). Of course, this would require mapping the Python heap into our process so we could see it ...
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 24.080u 0.150s 0:24.42 99.2% 0+0k 0+0io 474pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 23.630u 0.070s 0:23.88 99.2% 0+0k 0+0io 474pf+0wSuprisingly, not much better. We could be getting down to where we have to start tweaking the call site. I suppose the real question is, can we get the same time as "CReport Func, nothing else?" Right now, we are still off that mark. So, we move ALL code to the function that's gets called, pass 8 bytes (2 bytes length, 2 bytes of "code" of event and 4 bytes of "frame": The pointer to the frame).
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 21.560u 0.190s 0:21.99 98.9% 0+0k 0+0io 474pf+0wThat seems to be the best we can do. What if we make this module -O 5?
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 21.460u 0.150s 0:21.62 99.9% 0+0k 0+0io 474pf+0wLittle bit of a difference? Seems to be in the noise.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 21.860u 0.090s 0:21.95 100.0% 0+0k 0+0io 474pf+0wNo diff. We are down in the noise. We can't get much faster and still do work.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 35.720u 22.040s 0:57.89 99.7% 0+0k 0+0io 475pf+0wWe are 3x off of our performance JUST by the call to gettimeofday. Why? See below ...
3:54:38 up 7 days, 5:20, 9 users, load average: 0.25, 0.18, 0.07
94 processes: 92 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 19.0% 0.0% 9.9% 0.0% 0.0% 0.1% 70.9%
cpu00 60.6% 0.0% 39.4% 0.0% 0.0% 0.0% 0.0%
cpu01 6.0% 0.0% 0.0% 0.0% 0.0% 0.2% 93.8%
cpu02 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
cpu03 9.4% 0.0% 0.2% 0.0% 0.0% 0.4% 90.0%
Mem: 3094804k av, 2040724k used, 1054080k free, 0k shrd, 165572k buff
695088k active, 703740k inactive
Swap: 3145656k av, 0k used, 3145656k free 1401556k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
17010 rts 25 0 2552 2552 1612 R 24.9 0.0 0:09 0 python
17009 rts 20 0 888 888 764 S 3.8 0.0 0:01 3 lreader
1 root 15 0 512 512 452 S 0.0 0.0 0:08 1 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
"Top" tells the story: We are spending 39.4% time (+ another 9.9%?) inside
of the system. No no no! That's WAY too expensive. We are context
switching into the kernel to get the times.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 35.530u 14.660s 0:50.27 99.8% 0+0k 0+0io 476pf+0w"Top" reports (while it is running)
4:18:18 up 7 days, 5:44, 9 users, load average: 0.31, 0.08, 0.06
92 processes: 90 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 22.7% 0.0% 7.5% 0.0% 0.0% 0.0% 69.6%
cpu00 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 99.8%
cpu01 5.2% 0.0% 0.0% 0.0% 0.0% 0.2% 94.6%
cpu02 69.8% 0.0% 30.2% 0.0% 0.0% 0.0% 0.0%
cpu03 15.8% 0.0% 0.0% 0.0% 0.0% 0.0% 84.2%
Mem: 3094804k av, 2043528k used, 1051276k free, 0k shrd, 165572k buff
694972k active, 703748k inactive
Swap: 3145656k av, 0k used, 3145656k free 1401600k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
17222 rts 25 0 2560 2560 1616 R 24.9 0.0 0:09 2 python
17221 rts 21 0 888 888 764 S 5.2 0.0 0:02 1 lreader
1 root 15 0 512 512 452 S 0.0 0.0 0:08 1 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
So "clock()" has a little bit better performance, but is still
costing us: It appears (again) that we are going through the kernel
to get our values. Can't we just read a memory location and get the
value?
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 23.720u 0.140s 0:24.01 99.3% 0+0k 0+0io 475pf+0w(Notice that we don't spend much time in the kernel, so that our system time is very small: This means we are not going back and forth between the kernel!) However, I need to "look" at the numbers returned by clock_gettime and make sure they look good.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 34.340u 21.260s 0:55.77 99.6% 0+0k 0+0io 476pf+0wSigh. Same problem with gettimeofday: Too much time in the system.
/* * This little x86 app prints out the # of tight loops performed per sec, * averaged over 10 seconds. */ #includeOf course, the most important piece of code is the rdtscll: it reads a monotonic clock to (a single asm instruction! yes!) and it seems to work for us. The relevant piece of code from this example is:#include #include #define rdtscll(val) \ __asm__ __volatile__ ("rdtsc;" : "=A" (val)) #define SECS 10ULL int main (int argc, char **argv) { unsigned long long start, now, mhz, limit; unsigned int count; if (argc != 2) { printf("usage: loop_print \n"); exit(-1); } mhz = atol(argv[1]); limit = mhz * SECS; repeat: rdtscll(start); count = 0; /* * Just burn cycles in a tight, cached loop and measure absolute * cycles elapsed, vs. loops performed: */ for (;;) { count++; rdtscll(now); if (now - start > limit) break; } printf("userspace speed: %d loops.\n", count/SECS); fflush(stdout); goto repeat; }
#define rdtscll(val) \
__asm__ __volatile__ ("rdtsc;" : "=A" (val))
Excellent! A quick trial shows this working. So, now the scary
part: How well does this work with our cprof module? How much will
it cost us? In this example, all we do is add the instruction
and call it, we don't yet send the info over the Piper.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 22.880u 0.170s 0:23.28 99.0% 0+0k 0+0io 474pf+0wNotice: no system time (relatively speaking) being used. Just to be sure, here's a capture of top:
10:00:02 up 8 days, 1:26, 7 users, load average: 0.63, 0.20, 0.07
105 processes: 102 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 37.5% 0.0% 2.1% 0.0% 0.0% 0.3% 60.0%
cpu00 37.6% 0.0% 1.0% 0.0% 0.0% 0.0% 61.4%
cpu01 44.0% 0.0% 0.6% 0.0% 0.0% 0.8% 54.6%
cpu02 36.4% 0.0% 6.4% 0.0% 0.0% 0.0% 57.2%
cpu03 32.2% 0.0% 0.4% 0.0% 0.0% 0.4% 67.0%
Mem: 3094804k av, 2087324k used, 1007480k free, 0k shrd, 166144k buff
739664k active, 702448k inactive
Swap: 3145656k av, 0k used, 3145656k free 1404148k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
20729 rts 25 0 2548 2548 1608 R 24.7 0.0 0:08 2 python
20728 rts 24 0 888 888 764 R 9.7 0.0 0:03 3 lreader
19666 wal 16 0 5548 5548 4220 S 1.9 0.1 0:11 3 artsd
19604 wal 15 0 32696 31M 2452 S 0.9 1.0 6:10 1 Xvnc
14496 rts 15 0 8708 8708 4128 S 0.2 0.2 1:08 0 emacs
19678 wal 15 0 13436 13M 11396 S 0.1 0.4 0:02 0 kdeinit
19680 wal 15 0 10644 10M 9204 S 0.1 0.3 0:05 0 kdeinit
1 root 15 0 512 512 452 S 0.0 0.0 0:09 3 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
Notice that the system time is "negligible", and the CPU times add up to
about 1.3 CPUs, which is 1 cpu of python sending events and .3 cpu of lreader
just grabbing them off the queue. Kudos!
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 20.970u 0.090s 0:21.12 99.7% 0+0k 0+0io 474pf+0wEvery 10 times? (I.e., we only report an event every 10 times)
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 21.170u 0.020s 0:21.36 99.2% 0+0k 0+0io 474pf+0wEvery 100 times?
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 18.990u 0.020s 0:19.18 99.1% 0+0k 0+0io 474pf+0wSo, we can "dial down the events", but it doesn't hurt us that much to see every event.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 28.730u 0.440s 0:29.29 99.5% 0+0k 0+0io 474pf+0wCompare that to just sending the pointer (with the timestanp) directly from the call site:
aidl380-01 % !tim time python testshm.py done 10 done 20 done 30 done 35 23.450u 0.210s 0:23.85 99.2% 0+0k 0+0io 474pf+0w23.85 seconds vs. 29.29 seconds: Almost 5 seconds! And that's for very short function names. I am thinking that once we get into "real function names" like in "wireframe.py", we'll see much more significant names (80 characters with full paths, etc.) which will really slow us down further. Originally, I didn't think it would be that big of a deal, but 5 seconds is a long time in a "ultra lightweight monitoring architecture".
39 39 41 41 77 81 76 73 73 76 76 76 84 74 74 78 78 72 72 72 72 72 72 72 72 29 29 74 74 74 74 74 74 74 74 84 84 82 81 81 82 84 81 70 74 74 70 77 40 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 79 79 81 82 82 81 81 81 84 82 83 82 82 81 82 82 81 83 82 84 84 82 83 82 82 81 82 82 82 81 81 82 81 83 82 84 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 79 79 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 79 79 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 81 82 82 79 79 81 82 82 81 80 80 81 81 77 78 82 81 81 82 78 77 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 81 82 82 79 79 81 82 82 81 80 80 81 81 77 78 82 81 81 82 78 77 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 76 76 81 81 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 79 79 81 81 82 82 79 79 81 82 82 82 81 81 82 81 80 80 81 81 87 81 82 82 81 87 40 40 40 81 81 43 84 81 81 79 79 81 82 82 81 81 78 78 84 84 84 84 81 81 81 81 84 82 81 81 82 84 81 81 81 84 84 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 77 78 82 81 81 82 78 77 81 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 77 78 82 81 81 82 78 77 81 82 82 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 87 81 82 82 81 87 81 81 81 81 77 78 82 81 81 82 78 77 77 78 82 81 81 82 78 77 82 83 82 82 81 82 82 81 80 80 83 82 81 81 79 79 81 82 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 81 80 80 81 81 87 81 82 82 81 87 82 83 82 82 81 82 82 81 80 80 83 82 81 81 81 81 77 78 82 81 81 82 78 77 77 78 82 81 81 82 78 77 81 81 82 82 79 79 81 82 82 82 81 81 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 82 81 81 82 81 81 81 87 81 82 82 81 87 81 81 82 82 79 79 81 82 82 82 81 81 82 81 81 81 87 81 82 82 81 87 84 82 81 81 82 84 77 78 82 81 81 82 78 77 77 78 82 81 81 82 78 77 84 78 78 78 73 70 69 69 70 73 86 73 73 72 72 76 76 79 79 72 72 79 79 86 89 79 79 72 72 79 79 89 90 86 76 76 78 78 73 73 73 73 86 86 73 73 79 79 72 72 79 79 86 72 72 73 70 69 69 70 73 73 76 76 76 76 77 75 68 67 69 69 67 68 68 67 69 69 67 68 68 67 69 69 67 68 68 67 69 69 67 68 75 69 70 78 78 77 78 78 73 76 74 74 76 76 76 78 74 76 76 71 74 74 71 74 74 71 74 74 71 75 75 75 75 80 76 76 80 75 75 75 75 79 79 79 79 73 73 79 79 71 74 74 71 75 75 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 74 71 74 73 73 78 71 74 74 71 73 75 83 76 76 76 76 83 78 77 76 76 79 79 77 78 79 79 79 79 78 78 75 77 70 69 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 72 72 73 70 69 69 70 73 73 76 76 76 76 77 77 73 79 79 79 79 79 79 79 79 89 76 76 78 78 89 73 73 89 79 79 72 72 79 79 89 90 82 82 81 81 81 81 84 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 85 78 78 85 85 85 85 85 84 94 76 76 89 79 79 72 72 79 79 89 94 81 81 86 89 76 76 78 78 89 80 77 77 80 86 81 81 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 78 29 29 41 91 91 41 36 79 79 77 77 69 69 77 77 69 69 43 43 87 80 81 82 82 81 80So, lots of bigger numbers, like 70-80 characters. For "testshm.py", we see numbers like 33 or so. Let's, just for grins, make every event 80 characters and see what the slowdown is (when we build the event manually by copying into the the buffer instead of just sending a pointer).
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 30.440u 1.140s 0:31.60 99.9% 0+0k 0+0io 474pf+0wSo, we lose about 2-2.3 seconds (from 29.29 to 31.6) for copying larger filenames over. Curses! I want the 23 seconds number. The only way to do that is to map the python heap into our space.
(1) Original algae demo (first thing we got working) (2) List monitor demo (nova) (3) Better algae demo (with proc call clicks?) (4) Some other hooks (Strings, memory?) (5) topI also ran my idea of the paper my Peter: he thinks it might be good, it just needs some research (ACM digital library, here I come!). I need to talk to Mary about that.
aidl380-01 % time python pyText2Pdf.py wiki3.txt Using font Courier size = 10 Input file => wiki3.txt Writing pdf file wiki3.txt.pdf ... Wrote file wiki3.txt.pdf 17.370u 0.030s 0:17.63 98.6% 0+0k 0+0io 459pf+0w ( ~/hooks/ULMA2/Hooks/cprof/samples ) aidl380-01 %Then, WITH
aidl380-01 % time python pyText2Pdf.py wiki3.txt Using font Courier size = 10 Input file => wiki3.txt Writing pdf file wiki3.txt.pdf ... Wrote file wiki3.txt.pdf 21.910u 0.150s 0:22.15 99.5% 0+0k 0+0io 496pf+0w ( ~/hooks/ULMA2/Hooks/cprof/samples )So, it does add overhead. (This example was chosen because it has a little more I/O but it is still CPU intensive). For grins and giggles, here's the last topdemo output:
**FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :writestr : 3.6311e+10 98.1 : 00.0 : 99.9 : pyText2Pdf.py: 316 :StartPage : 5.1111e+08 01.4 : 69.7 : 30.2 : pyText2Pdf.py: 430 :EndPage : 1.6711e+08 00.5 : 70.3 : 29.6 : pyText2Pdf.py: 481 :? : 3.3821e+06 00.0 : 00.0 : 99.9 : b/python2.3/string.py: 20 :WriteHeader : 9.7501e+05 00.0 : 73.5 : 26.4 : pyText2Pdf.py: 374 :parseArgs : 2.6778e+05 00.0 : 19.2 : 80.7 : pyText2Pdf.py: 282 :? : 9.2917e+04 00.0 : 10.2 : 89.7 : b/python2.3/getopt.py: 16 :__init__ : 4.4529e+04 00.0 : 00.0 : 99.9 : pyText2Pdf.py: 152 :getopt : 3.8185e+04 00.0 : 00.0 : 99.9 : b/python2.3/getopt.py: 52 :pyText2Pdf : 2.8017e+04 00.0 : 00.0 : 99.9 : pyText2Pdf.py: 150 :argsCallBack : 1.3365e+04 00.0 : 00.0 : 99.9 : pyText2Pdf.py: 197 :GetoptError : 9.4650e+03 00.0 : 00.0 : 99.9 : b/python2.3/getopt.py: 39Here's two runs of the Sieve, the first WITHOUT profiling information, the second WITH.
aidl380-01 % time python testera.py 1299709 2750159 4256233 5800079 7368787 8960453 10570841 12195257 13834103 15485863 17144489 18815231 15.450u 0.010s 0:16.02 96.5% 0+0k 0+0io 436pf+0w ( ~/hooks/ULMA2/Hooks/cprof/samples ) aidl380-01 % time python testera.py 1299709 2750159 4256233 5800079 7368787 8960453 10570841 12195257 13834103 15485863 17144489 18815231 17.150u 0.110s 0:17.37 99.3% 0+0k 0+0io 478pf+0w ( ~/hooks/ULMA2/Hooks/cprof/samples ) aidl380-01 %The good news here is that sieve, a very cpu intensive program, doesn't take much of a hit. That's probably because the number of procedure call events is small: I think most of the work is done in the C routines, which the current version of Python does not generate events for.
I shaved off MAYBE another second. So I started thinking... "I wish I
had ULMA for this
decrypt." And then I realized, I can!
I am working on something simple that users can compile into their
programs:
main()
{
while (...) {
}
}
You might instrument with:
#include "ts.h"
EventWriter MyEventWriter("hello", 'W')
#define PROFILER_WRITER MyEventWriter // undef to turn off profiling
main()
{ TS("main");
while (...) { TS("first while");
}
}
Where TS becomes ... -> TimeStanpScope ps("function name", __FILE__,
__LINE__)
Which generates a "call event" at TimeStampScope construction and a
"return
event" and TimeStampcope destruction. This way you can keep
instrumenting and
taking out profiling. Eh?
With this, we can use the ULMA tools in C++ too!
(And in C: There's a C routine called "generateULMAEvent" which is
called for C++ constr/destr).
Gooday,
Richie
P.S> Curse you! I can't get isopatterns out of my head! ;)
Table of Contents Introduction Background Issues with Performance Monitoring with Intel® SpeedStep® Conclusion Additional Resources Sources Executable Dynamic Link Library Driver Page Options Printer friendly version Issues with Performance Monitoring with Intel® SpeedStep® As of this writing, the Single Processor HAL (Hardware Abstraction Layer) of Windows 2000*/XP uses the same 1.193180 megahertz counter for its "QueryPerformanceCounter" API as all previous versions of Windows. However, the Multiprocessor HAL of Windows 2000 and XP use the system's own clock rate through the RDTSC processor instruction. This causes problems for software where performance monitoring is underway and Intel Speed Stepping is enabled as the values read from the RDTSC instruction are varying with respect to frequency and hence the timing values read are typically skewed. By changing the power usage on a system running Intel Speed Stepping, such that Speed Stepping is effectively pushed to where no frequency changes occur, the RDTSC instruction becomes reliable from one read to the next and is hence useful for the purposes of code profiling. This can be done simply enough by indirect calls to the Processor Power Scheme API calls ReadProcessorPwrScheme and WriteProcessorPwrScheme. However, the Windows XP operating system does not allow for a method of being interrupted when the Power scheme changes. The purpose of this mechanism is to be provided notification in the event some process, or user intervention changes the power scheme to other than the non-frequency changing state set state while the profiling session is underway. The method described here utilizes a power-managed driver to detect these subtleties and notify an event handler in user mode process space. In doing this, the user mode process can defer any power scheme changes to after the profiling session is completed thus retaining consistency in the profiling metrics. Performance Event Interception The method involves first registering a power managed event callback (DeviceDispatchPower) in a driver running in Kernel Mode. The Driver source for this routine is listed in section 2.2.2.3. This routine responds to all Power Managed events in the operating system. In a standard Power-managed scenario this routine is provided to the driver so that it can change the power states on the hardware that it is controlling. Since the Driver is filtering Power events for the entire system it provides an easy scheme for filtering these events and preventing deleterious event changes by deferring them off to a time after the profiling session has completed.(From the intel web site). Basically, if "power saving" settings are used, the rdtscll call becomes inconsistent. The intel site mentions a solution for the XP, but is there one for Linux? Actually, it sounds like "yet another axis" of problems to address. Again, we are probably getting luck because the Xeon probably doesn't use this? This link looks like it has some information: It's the smp-boot code for Linux. http://www-d0.fnal.gov/D0Code/source/l3xsbc_distrib/linux/arch/i386/kernel/smpboot.c Perfmon utility: http://www.cse.msu.edu/~enbody/pc_perfmon/ Beowulf: http://www.beowulf.org/pipermail/beowulf/2003-March/006131.html
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 17.470u 0.010s 0:17.58 99.4% 0+0k 0+0io 433pf+0w ( ~/hooks/ULMA2/Hooks/cprof )Here's with an eventwriter that sends over the frame, basically as is as much as possible.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 22.500u 0.210s 0:22.83 99.4% 0+0k 0+0io 473pf+0wI am convinced this is about the best we can do. The question NOW, is can we actually keep the frames around long enough to actually get the information out? Here's another test: INC and DEC the refcount: We'll have to do at LEAST this much work to keep the frame around. What's the effect?
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 22.890u 0.250s 0:23.27 99.4% 0+0k 0+0io 473pf+0wHalf a second? Okay, not bad.
aidl380-01 % time python testshm.py done 10 done 20 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.9128e+11 99.9 : 99.8 : 00.1 : testshm.py: 2 done 30 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 3.9541e+11 99.9 : 96.5 : 03.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 6.0489e+11 99.9 : 97.7 : 02.2 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 8.2450e+11 99.9 : 98.3 : 01.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.0647e+12 99.9 : 98.6 : 01.3 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.2607e+12 99.9 : 98.8 : 01.1 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.5483e+12 99.9 : 99.0 : 00.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.7446e+12 99.9 : 99.1 : 00.8 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.9771e+12 99.9 : 99.2 : 00.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.1712e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.4981e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.7065e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.9259e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 3.1694e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 3.3810e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 done 35 WARNING: starting profiling inside a routine, won't count first return 49.560u 0.010s 0:49.71 99.7% 0+0k 0+0io 486pf+0wWhich is pretty much what I expected: When a full topdemo is running, it almost takes up a full CPU (actually I expected about 40 seconds, this is longer than I expected). When the topdemo is "folded in", then the almost two full CPUS becomes 1 very loaded CPU. The sad thing is that the "heap" is shared in this case. Don't like it, but that's the way it is. At one point, I will have to rewrite top demo so it can have a dedicated heap: separate from the main program.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 18.160u 0.020s 0:18.33 99.1% 0+0k 0+0io 480pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % !! time python testshm.py done 10 done 20 done 30 done 35 18.640u 0.020s 0:18.72 99.6% 0+0k 0+0io 480pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % !! time python testshm.py done 10 done 20 done 30 done 35 18.190u 0.010s 0:18.21 99.9% 0+0k 0+0io 480pf+0wAverage wall clock time of about 18.4 seconds. Now, sending lightweight events to ANOTHER PROCESS, but that process does NOTHING with the event (except read it).
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 22.390u 0.100s 0:22.62 99.4% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 21.980u 0.050s 0:22.06 99.8% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 21.670u 0.090s 0:21.90 99.3% 0+0k 0+0io 511pf+0wAbout 22 seconds of elapsed time. Let's look at a "top" while the programs are running: Notice that the CPU is pegged (as expected) for the fib program, and the "sitereader" (a do nothing except read the SHMfib Shared memory event queue) is a mere 0.9%. Notice that the total CPU is 26.5% .. this is the 24.9% + .9% (plus other). This will be important to watch later to see if we are getting real concurrency.
09:31:18 up 20 days, 2:05, 7 users, load average: 0.46, 0.34, 0.39
105 processes: 103 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 26.5% 0.0% 0.6% 0.0% 0.0% 0.2% 72.6%
cpu00 30.8% 0.0% 1.2% 0.0% 0.0% 0.0% 68.0%
cpu01 1.4% 0.0% 0.6% 0.0% 0.0% 0.4% 97.6%
cpu02 71.2% 0.0% 0.2% 0.0% 0.0% 0.0% 28.6%
cpu03 2.8% 0.0% 0.4% 0.0% 0.0% 0.4% 96.4%
Mem: 3094804k av, 3001104k used, 93700k free, 0k shrd, 270084k buff
1627584k actv, 745400k in_d, 52268k in_c
Swap: 3145656k av, 8860k used, 3136796k free 2244528k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
25653 rts 25 0 2704 2704 1740 R 24.9 0.0 0:14 2 python
25652 rts 20 0 888 888 764 S 0.9 0.0 0:00 3 sitereader
3130 wal 16 0 5568 5568 4220 S 0.5 0.1 2:30 3 artsd
3145 wal 15 0 11388 11M 9764 S 0.1 0.3 1:01 1 kdeinit
25332 wal 15 0 10080 9.8M 7452 S 0.1 0.3 0:03 3 gaim
1 root 15 0 508 476 448 S 0.0 0.0 0:20 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
13 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 bdflush
Let's now try reading the event queue FROM A THREAD IN THE SAME PROCESS,
where the reader just throws away the event.
We expect exactly the same performance, right? After all, we are
using a kernel thread which should be able to run on another CPU.
aidl380-01 % time python testshm.py event_queue_name = shmfib! Top Demo for LightWeight Events here?shmfib done 10 Opened ... done 20 done 30 done 35 21.600u 0.050s 0:21.83 99.1% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! Top Demo for LightWeight Events here?shmfib done 10 Opened ... done 20 done 30 done 35 21.620u 0.020s 0:21.67 99.8% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! Top Demo for LightWeight Events here?shmfib done 10 Opened ... done 20 done 30 done 35 21.350u 0.080s 0:21.44 99.9% 0+0k 0+0io 511pf+0w21.6 seconds? Maybe a bit faster because of caching? Notice the top profile:
09:42:58 up 20 days, 2:16, 7 users, load average: 0.32, 0.20, 0.25
104 processes: 102 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 25.1% 0.0% 0.5% 0.0% 0.0% 0.0% 74.2%
cpu00 99.0% 0.0% 0.9% 0.0% 0.0% 0.0% 0.0%
cpu01 0.9% 0.0% 0.3% 0.0% 0.0% 0.0% 98.6%
cpu02 0.1% 0.0% 0.0% 0.0% 0.0% 0.3% 99.4%
cpu03 0.3% 0.0% 0.7% 0.0% 0.0% 0.0% 98.8%
Mem: 3094804k av, 3000064k used, 94740k free, 0k shrd, 270084k buff
1627516k actv, 745424k in_d, 52268k in_c
Swap: 3145656k av, 8860k used, 3136796k free 2244552k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
25690 rts 25 0 2736 2736 1764 R 24.9 0.0 0:17 0 python
1 root 15 0 508 476 448 S 0.0 0.0 0:20 2 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
13 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 bdflush
11 root 15 0 0 0 0 SW 0.0 0.0 0:29 1 kswapd
12 root 15 0 0 0 0 SW 0.0 0.0 0:23 2 kscand
14 root 15 0 0 0 0 SW 0.0 0.0 0:17 1 kupdated
15 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
Hm. Another full CPU with about 25% on another. Notice that the
top on Linux (for this version at least) will never give us more than
25% for CPU since this is a "4CPU" machine.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 26.540u 0.130s 0:29.27 91.1% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 26.540u 0.180s 0:29.30 91.1% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 26.180u 0.140s 0:28.97 90.8% 0+0k 0+0io 511pf+0wAbout 29 seconds. Hm. I expected better. This feels about the same time as chunky events. But this seems so wrong! It feels like we should be at least twice as fast as CHUNKY events. Hm.
103 processes: 101 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
09:48:35 up 20 days, 2:22, 7 users, load average: 0.88, 0.43, 0.31
103 processes: 101 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 47.6% 0.0% 0.1% 0.0% 0.0% 0.1% 52.1%
cpu00 44.6% 0.0% 0.2% 0.0% 0.0% 0.4% 54.8%
cpu01 3.2% 0.0% 0.0% 0.0% 0.0% 0.0% 96.8%
cpu02 51.6% 0.0% 0.2% 0.0% 0.0% 0.0% 48.2%
cpu03 91.2% 0.0% 0.0% 0.0% 0.0% 0.0% 8.8%
Mem: 3094804k av, 2994820k used, 99984k free, 0k shrd, 270084k buff
1623404k actv, 745428k in_d, 52268k in_c
Swap: 3145656k av, 8860k used, 3136796k free 2244556k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
25721 rts 25 0 2864 2864 1884 R 24.9 0.0 0:12 3 python
25332 wal 15 0 10164 9.9M 7468 S 0.2 0.3 0:05 0 gaim
1 root 15 0 508 476 448 S 0.0 0.0 0:20 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
13 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 bdflush
11 root 15 0 0 0 0 SW 0.0 0.0 0:29 1 kswapd
All right, let's add in the "showTop" for completeness.
aidl380-01 % time python testshm.py event_queue_name = shmfib! Top Demo for LightWeight Events here?shmfib done 10 Opened ... done 20 done 30 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0256e+11 99.9 : 95.2 : 04.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 4.0175e+11 99.9 : 97.5 : 02.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 6.1219e+11 99.9 : 98.3 : 01.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 8.4691e+11 99.9 : 98.8 : 01.1 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.0927e+12 99.9 : 99.0 : 00.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.3107e+12 99.9 : 99.2 : 00.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.5860e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.8066e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0322e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.2349e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 done 35 27.510u 0.140s 0:31.72 87.1% 0+0k 0+0io 510pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! Top Demo for LightWeight Events here?shmfib done 10 Opened ... done 20 done 30 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0381e+11 99.9 : 95.1 : 04.8 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 4.0425e+11 99.9 : 97.5 : 02.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 6.1248e+11 99.9 : 98.3 : 01.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 8.4501e+11 99.9 : 98.7 : 01.2 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.1022e+12 99.9 : 99.0 : 00.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.3043e+12 99.9 : 99.2 : 00.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.5109e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.8056e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0089e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.2213e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.4532e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 done 35 31.340u 0.110s 0:36.92 85.1% 0+0k 0+0io 510pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! Top Demo for LightWeight Events here?shmfib done 10 Opened ... done 20 done 30 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0031e+11 99.9 : 95.6 : 04.3 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 4.0418e+11 99.9 : 97.8 : 02.1 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 6.1793e+11 99.9 : 98.5 : 01.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 8.4450e+11 99.9 : 98.9 : 01.0 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.1037e+12 99.9 : 99.1 : 00.8 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.3251e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.6119e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.8147e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0261e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.2590e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 done 35 29.910u 0.180s 0:32.78 91.7% 0+0k 0+0io 510pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! Top Demo for LightWeight Events here?shmfib done 10 Opened ... done 20 done 30 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.9855e+11 99.9 : 96.0 : 03.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 4.0033e+11 99.9 : 98.0 : 01.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 6.4510e+11 99.9 : 98.7 : 01.2 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 8.4536e+11 99.9 : 99.0 : 00.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.0956e+12 99.9 : 99.2 : 00.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.3136e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.5918e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.8135e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0372e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 done 35 27.830u 0.130s 0:31.39 89.0% 0+0k 0+0io 510pf+0wThe interesting thing here is that we don't think we are using all the CPU .. Perhaps it should be taking 27.8 seconds *user time* but with the extra Top and extra computer top work. Anyway, we are taking extra time with the ShowTop 32 seconds? (The 36 may be an aberration). Now, it's time to go back and revisit Chunky Events. (We have been playing with buffer sizes and such, so let's revisit Chunky Events so they have the same buffger size: 64K) Let's send Chunky Events to a process that does nothing but read's the events and throws them away.
aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 32.390u 0.220s 0:32.58 100.0% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 32.810u 0.260s 0:33.27 99.3% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py done 10 done 20 done 30 done 35 33.580u 0.330s 0:33.96 99.8% 0+0k 0+0io 511pf+0wHere's what the machine is doing ... So, okay. Just sending Chunky events to a process reading the events is the same time as the lightweight events when we are doing "top!". Okay, now I feel better.
109 processes: 106 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 33.6% 0.0% 1.0% 0.0% 0.0% 0.1% 65.0%
cpu00 20.1% 0.0% 0.7% 0.0% 0.0% 0.3% 78.6%
cpu01 86.0% 0.0% 0.3% 0.0% 0.0% 0.0% 13.5%
cpu02 25.8% 0.0% 3.2% 0.0% 0.0% 0.2% 70.8%
cpu03 2.7% 0.0% 0.0% 0.0% 0.0% 0.0% 97.2%
Mem: 3094804k av, 3015980k used, 78824k free, 0k shrd, 270084k buff
1641912k actv, 745360k in_d, 51744k in_c
Swap: 3145656k av, 8860k used, 3136796k free 2247360k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
25960 rts 25 0 2704 2704 1740 R 24.9 0.0 0:09 1 python
25959 rts 23 0 1000 1000 876 R 8.5 0.0 0:03 0 lreader
3068 wal 15 0 34048 33M 2484 S 0.4 1.1 92:48 2 Xvnc
3142 wal 15 0 14296 13M 11924 S 0.1 0.4 3:31 2 kdeinit
23633 wal 15 0 8192 8156 4072 S 0.1 0.2 0:06 0 emacs
1 root 15 0 508 476 448 S 0.0 0.0 0:20 2 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
13 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 bdflush
Now, let's send Chunky events to a thread in the same process which
only reads them and throws them away. We expect about the same elapsed
time: 33 seconds.
aidl380-01 % time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 done 30 done 35 31.130u 0.230s 0:31.45 99.7% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % !! time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 done 30 done 35 31.050u 0.290s 0:31.45 99.6% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 done 30 done 35 32.500u 0.200s 0:32.78 99.7% 0+0k 0+0io 511pf+0wInteresting: we are a "touch" faster when we do the chunky events in a thread instead of another process. Perhaps because the context switch time is cheaper? Caching? Here's top:
10:24:59 up 20 days, 2:58, 8 users, load average: 0.92, 0.48, 0.30
109 processes: 107 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 27.5% 0.0% 0.7% 0.0% 0.0% 0.0% 71.6%
cpu00 5.0% 0.0% 0.8% 0.0% 0.0% 0.0% 94.2%
cpu01 33.4% 0.0% 0.6% 0.0% 0.0% 0.2% 65.8%
cpu02 6.5% 0.0% 0.5% 0.0% 0.0% 0.0% 92.8%
cpu03 65.0% 0.0% 0.9% 0.0% 0.0% 0.0% 33.9%
Mem: 3094804k av, 3020768k used, 74036k free, 0k shrd, 270084k buff
1647664k actv, 745376k in_d, 51744k in_c
Swap: 3145656k av, 8860k used, 3136796k free 2252508k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
26066 rts 25 0 2888 2888 1912 R 24.9 0.0 0:15 3 python
1 root 15 0 508 476 448 S 0.0 0.0 0:20 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
13 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 bdflush
11 root 15 0 0 0 0 SW 0.0 0.0 0:29 2 kswapd
12 root 15 0 0 0 0 SW 0.0 0.0 0:23 0 kscand
14 root 15 0 0 0 0 SW 0.0 0.0 0:17 3 kupdated
15 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
Now, the stuff. Let's have the other thread compute Top info
without showing the top.
aidl380-01 % time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 done 30 done 35 36.450u 0.240s 0:37.02 99.1% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 done 30 done 35 39.360u 0.230s 0:39.65 99.8% 0+0k 0+0io 511pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 done 30 done 35 36.440u 0.300s 0:37.03 99.2% 0+0k 0+0io 511pf+0wSo, then computing top seems to really limit us. It has brought us up to 38 seconds.
10:28:16 up 20 days, 3:02, 8 users, load average: 0.60, 0.47, 0.32
109 processes: 108 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 46.0% 0.0% 0.9% 0.0% 0.4% 0.4% 51.9%
cpu00 70.5% 0.0% 0.0% 0.0% 1.9% 0.0% 27.4%
cpu01 25.4% 0.0% 0.0% 0.0% 0.0% 1.9% 72.5%
cpu02 29.7% 0.0% 1.9% 0.0% 0.0% 0.0% 68.3%
cpu03 58.4% 0.0% 1.9% 0.0% 0.0% 0.0% 39.6%
Mem: 3094804k av, 3019852k used, 74952k free, 0k shrd, 270084k buff
1647660k actv, 745372k in_d, 51744k in_c
Swap: 3145656k av, 8860k used, 3136796k free 2252504k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
26087 rts 25 0 2888 2888 1912 S 24.9 0.0 0:18 0 python
3068 wal 15 0 39060 38M 7400 S 1.4 1.2 93:03 1 Xvnc
3145 wal 15 0 11388 11M 9764 S 0.2 0.3 1:03 2 kdeinit
26090 rts 20 0 1236 1236 868 R 0.2 0.0 0:00 2 top
1 root 15 0 508 476 448 S 0.0 0.0 0:20 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
13 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 bdflush
11 root 15 0 0 0 0 SW 0.0 0.0 0:29 3 kswapd
Finally, let's make sure the other thread shows the computed top
demo.
aidl380-01 % time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.9027e+11 99.9 : 99.9 : 00.0 : testshm.py: 2 done 30 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 3.9586e+11 99.9 : 97.2 : 02.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 6.1902e+11 99.9 : 98.2 : 01.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 8.4755e+11 99.9 : 98.6 : 01.3 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.0542e+12 99.9 : 98.9 : 01.0 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.3100e+12 99.9 : 99.1 : 00.8 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.5316e+12 99.9 : 99.2 : 00.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.8277e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0319e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.2477e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.4775e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 done 35 36.820u 0.310s 0:37.38 99.3% 0+0k 0+0io 510pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.9313e+11 99.9 : 99.8 : 00.1 : testshm.py: 2 done 30 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 3.9928e+11 99.9 : 97.0 : 02.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 6.1877e+11 99.9 : 98.0 : 01.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 8.4832e+11 99.9 : 98.5 : 01.4 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.0517e+12 99.9 : 98.8 : 01.1 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.3116e+12 99.9 : 99.0 : 00.9 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.5323e+12 99.9 : 99.1 : 00.8 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.8320e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0363e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.2475e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.4772e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.6780e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 done 35 38.180u 0.220s 0:38.59 99.5% 0+0k 0+0io 510pf+0w ( ~/hooks/ULMA2/Hooks/cprof ) aidl380-01 % time python testshm.py event_queue_name = shmfib! sizeof(AccumulatingTimes)=12 done 10 done 20 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.8913e+11 99.9 : 99.8 : 00.1 : testshm.py: 2 done 30 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 4.0850e+11 99.9 : 97.4 : 02.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 6.1906e+11 99.9 : 98.3 : 01.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 8.5266e+11 99.9 : 98.7 : 01.2 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.0588e+12 99.9 : 98.9 : 01.0 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.3157e+12 99.9 : 99.1 : 00.8 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.5358e+12 99.9 : 99.2 : 00.7 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 1.8343e+12 99.9 : 99.3 : 00.6 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.0436e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.2602e+12 99.9 : 99.4 : 00.5 : testshm.py: 2 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :fib : 2.4903e+12 99.9 : 99.5 : 00.4 : testshm.py: 2 done 35 37.020u 0.230s 0:37.66 98.9% 0+0k 0+0io 510pf+0wWell, good. The extra "show top" work didn't hold us up too much.
10:33:54 up 20 days, 3:07, 8 users, load average: 1.48, 1.02, 0.60
109 processes: 107 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 46.3% 0.0% 0.6% 0.0% 0.0% 0.2% 52.9%
cpu00 49.2% 0.0% 0.4% 0.0% 0.0% 0.4% 50.0%
cpu01 78.6% 0.0% 0.6% 0.0% 0.0% 0.0% 20.8%
cpu02 34.2% 0.0% 0.6% 0.0% 0.0% 0.4% 64.8%
cpu03 23.2% 0.0% 0.8% 0.0% 0.0% 0.0% 76.0%
Mem: 3094804k av, 3020520k used, 74284k free, 0k shrd, 270084k buff
1647696k actv, 745380k in_d, 51744k in_c
Swap: 3145656k av, 8860k used, 3136796k free 2252508k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
26126 rts 25 0 2920 2920 1944 R 24.9 0.0 0:19 1 python
25332 wal 15 0 11024 10M 7656 S 0.2 0.3 0:22 3 gaim
1 root 15 0 508 476 448 S 0.0 0.0 0:20 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 migration/2
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 migration/3
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
13 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 bdflush
11 root 15 0 0 0 0 SW 0.0 0.0 0:29 0 kswapd
12 root 15 0 0 0 0 SW 0.0 0.0 0:23 0 kscand
14 root 15 0 0 0 0 SW 0.0 0.0 0:17 2 kupdated
I feel much better: the lightweight events are cheaper than the heavyweight.
What I don't understand is why the "computing top" actually slows down
the monitored program. It MUST be the case that the monitor is not
reading the events fast enough. Even the double buffer is getting in
the way: I _think_ what is happening is that the event writer gets
to the end of the double buffer, and it "waits" for the event reader to
catch up at the end of every buffer. So, it spends some time
waiting. We should not allow this for ULMA. We could make the buffers
"bigger" so it happens less often, but then we have to worry about
"thrashing" as buffers get too big. Of course, the reader _IS_ keeping
up, almost. But, the writer of events can't even afford to wait a little
foe the reader to catch up that little bit. I _think_ the solution
here is for a "n-buffer" solution, "writer centric" where the writer always
has an extra buffer it can write to. Thus, when the writer reaches the
end of a buffer, it immediately jumps to the next free one (because the
reader will only be working on one at a time anyways). There will always be
a free buffer for the reader to immediately continue to, and when the reader
catches up, it can make progress more quickly as well.
So, I need to program this up to see if I can reach the "theoretical
peak" of just having an empty reader.
**FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :SLWHandleEvent : 1.1619e+11 58.9 : 51.5 : 48.4 : slwtopdemo.h: 9 :LWHandleReturnE : 3.3192e+10 16.8 : 00.0 : 99.9 : lwtopdemo.h: 208 :LWShowTop : 2.1074e+10 10.7 : 00.0 : 99.9 : lwtopdemo.h: 108 :LWHandleCallEve : 1.5498e+10 07.9 : 00.0 : 99.9 : lwtopdemo.h: 172 :SLWHandleEvent: : 1.1218e+10 05.7 : 00.0 : 99.9 : slwtopdemo.h: 21 **FUNCTION NAME**:*TOTAL TIME****%*:*ÞSC:*%PROC:*FILENAME*************:*LINE# ____________________________________________________________________________ :SLWHandleEvent : 1.2265e+11 58.9 : 51.5 : 48.4 : slwtopdemo.h: 9 :LWHandleReturnE : 3.5071e+10 16.8 : 00.0 : 99.9 : lwtopdemo.h: 208 :LWShowTop : 2.2223e+10 10.7 : 00.0 : 99.9 : lwtopdemo.h: 108 :LWHandleCallEve : 1.6370e+10 07.9 : 00.0 : 99.9 : lwtopdemo.h: 172 :SLWHandleEvent: : 1.1846e+10 05.7 : 00.0 : 99.9 : slwtopdemo.h: 21Basically, half of our time is building the event (the %PROC in SLWHandleEvent) and the other half is in either LWHandleReturnEvent or LWHandleCallEvent. You'll see that LSHandleReturnEvent does more work.
Super lightweight events, in threads (with top) 34.520u 0.940s 0:39.43 89.9% 0+0k 0+0io 490pf+0w Lightweight Events, in threads (with top) 26.840u 0.700s 0:28.98 95.0% 0+0k 0+0io 483pf+0w Chunky Events, in threads (with top) 38.230u 1.720s 0:40.21 99.3% 0+0k 0+0io 483pf+0wSlightly better than SLW is slightly better than Chunky, but LightweightEvents are still much cheaper. What I can work on: Making the super lightweight demo a little bit cheaper. How: Find the right choice for throughout and latency for buffer sizes. Also, can I make the protocol on the monitored program side cheaper? (I.e., a little code tuning). I did use one of my observations to my advantage: I only have to build a chunky event (under SLW protocol on monitor side) when I get a CALL event: All the return cares about is the timestamp in some sense.
Super lightweight events, in threads (with top) 21.340u 0.050s 0:21.59 99.0% 0+0k 0+0io 510pf+0w Lightweight Events, in threads (with top) 19.110u 0.080s 0:19.46 98.6% 0+0k 0+0io 503pf+0w Chunky Events, in threads (with top) 20.480u 0.280s 0:20.81 99.7% 0+0k 0+0io 502pf+0w NO EVENTS (just the program without profiling) 16.440u 0.060s 0:16.77 98.3% 0+0k 0+0io 501pf+0wHere's the sieve of Erast.
Super lightweight events, in threads (with top) seg fault Lightweight Events, in threads (with top) 18.770u 0.030s 0:18.96 99.1% 0+0k 0+0io 486pf+0w Chunky Events, in threads (with top) 17.330u 0.060s 0:17.56 99.0% 0+0k 0+0io 485pf+0w NO EVENTS (just the program without profiling) 17.060u 0.010s 0:17.11 99.7% 0+0k 0+0io 484pf+0wArgh. I have a bug in the SLW. A bigger text to pdf doc:
NO EVENTS 48.340u 0.060s 0:48.59 99.6% 0+0k 0+0io 501pf+0w Chunky Events, using threads, with top 60.940u 0.750s 1:01.93 99.6% 0+0k 0+0io 502pf+0w Lightweight Events, using threads, with top 55.040u 0.220s 0:55.52 99.5% 0+0k 0+0io 503pf+0w SuperLightweight Events, using threads, with top 59.820u 0.200s 1:00.22 99.6% 0+0k 0+0io 511pf+0w