C/C++: Caching the Camera Data
I am getting data from camera via Giga Bit Ethernet Interface. Because the data is very small, I am thinking of putting this directly into L1 cache (instead of DRAM), so that processing is fast.
How can I do this? Is there any compiler directive for this?
Platform information Windows 7, Intel Core2Duo, Visual Studio 2010, C/C++, OpenCV.
Gcc will generate data prefetch instructions for array data for -O1 -O2 -O3. This was all reworked fairly recent. So it is unlikely the _DATAPREFETCH flag will8u offer any improvement over the standard optimization levels.
As to bypassing memory, how is the data going to be assigned a cache tag? Cache tags are issued upon memory fetch and the the cpu finds the data in cache via it's cache tag.
The DuoCores share a tag bus, and I believe the GPUs can hang off the tag buss, so we could conceive of an I/O controller doing so, but I haven't found a reference yet. What socket is you Giga card in?
(An old list of tags for various cpu families.)[ http://gcc.gnu.org/projects/prefetch.html]
" -fprefetch-loop-arrays If supported by the target machine, generate instructions to prefetch memory to improve the performance of loops that access large arrays. This option may generate better or worse code; results are highly dependent on the structure of loops within the source code. "(gcc-4.7)[ http://gcc.gnu.org/onlinedocs/gcc-4.7.1/gcc/Optimize-Options.html#Optimize-Options ]
There wont be significant improvement in performance in high-end computers which have large caches.
Improvement in performance can occur in a case where the image is too large to be completely loaded in the cache and a lot of algorithms are to run over the same image a large number of times. In such cases, a portion of the image is loaded to cache and the algorithm runs on it, then the rest of the portion is copied and it keeps continuing. Such cases can be easily reproduced in embedded devices like Beagleboard which have 32kb L1 cache. In such cases, the performance of the algorithm can be improved by efficiently splitting up the image and running all the processes on a portion of the image before loading the next part of it.