Memory-wise, is it better to store a long non-dynamic string as a single string object or to have the program build it out of it's repetitive parts?

This is a bit of an odd question and more of a though experiment that anything I need, but I'm still curious about the answer: If I have a string that I know ahead of time will never change but is (mostly) made up of repetitive parts, would it be better to have said string as just a single string object, get called when needed, and be done with it - or should I break the string up into smaller strings that represent the repeated parts and concatenate them when needed?

Let me use an example: Let's say we have a naive programmer who wants to create a regular expression for validating IP Addresses (in other words, I know this regular expression won't work as intended, but it helps show what I mean by repetitive parts and saves me a bit of typing for the second part of the example). So he writes this function:

 private bool isValidIP(string ip)
 {
   Regex checkIP = new Regex("\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
   return checkIP.IsMatch(ip);
 }

Now our young programmer notices that he has "\d", "\d?", and "\." just repeated a few times. This gives him an idea that he could both save some storage space and help remind himself what this means for later. So he remakes the function:

 private bool isValidIP(string ip)
 {
   string escape = "\\";
   string digi = "d";
   string digit = escape + digi;
   string possibleDigit = digit + '?';
   string IpByte = digit + possibleDigit + possibleDigit;
   string period = escape + '.';
   Regex checkIP = new Regex(IpByte + period + IpByte + period + IpByte + period + IpByte);
   return checkIP.IsMatch(ip);
 }

The first method is simple. It just stores 38 chars in the program's instructions, which are just read into memory each time the function is called. The second method stores (I suspect) two 1 length strings and two chars into the program's instructions as well as all of the calls to concatenate those four into different orders. This creates at least 8 strings in memory when the program is called (the six named strings, a temporary string for the first four parts of the regex, and then the final string created from the previous string + the three strings of the regex). This second method also happens to help explain what the regex is looking for - though not what the final regex would look like. It could also help with refactoring, say if our hypothetical programmer realizes that his current regex will allow for more than just 0-255 in the IP Address, and the constitute parts can be changed without having to find every single item that would need to be fixed.

Again, which method would be better? Would it just be as simple as a trade-off between program size vs. memory usage? Of course, with something as simple as this, the trade-off is negligible at best, but what about a much larger, more complex string?

Oh, yes, and a much better regex for IP Addresses would be:

 ^(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)(\\.(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)){3}$

Wouldn't work as well as an example, would it?

Answers


The first is by far the better option. Here's why

  1. It's clearer.

  2. It's cheaper. Any time you declare a new object it's an "expensive" process. You have to make space for it on the heap (well for strings at least). Yes, you could in theory be saving a byte or so, but your spending a lot more time (probably, I haven't tested it) going through and allocating space for each string, additional memory instructions etc. Not to mention the fact that remember, you also have to factor in the use of the GC. You keep allocating strings and eventually you are going to have to contend with it taking up process ticks also. You really want to hit on optimization, I can easily tell this code isn't as efficient as it could be. There are no constants for one thing, which means that you are possibly creating more objects than you need instead of letting the compiler optimize for strings that don't need to change. This leads me to think, that as a person reviewing this code, I need to take a much closer look at what is going to see what is going on and figure out if something is wrong.

  3. It's clearer (yes, I said this again). You want to do an academic pursuit to see how efficient you can make it. That's cool. I get that. I do it myself. It's fun. I NEVER let that slip into production code. I don't care about losing a tick, I care about having a bug in production, and I care about if other programmers can understand what my code does. Reading someone else's code is hard enough, I don't want to add the extra task of them having to try and figure out which micro-optimization I put in and what happens if they "nudge" the wrong piece of code.

  4. You hit on another point. What if the original regex is wrong. Google will tell you this problem has been solved. You can Google another regex that's right and has been tested. You can't Google "What's wrong with my code." You can post it on SO sure, but that means that someone else has to get involved and look through it.

Here's how to make the first example win the horse race easily:

 Regex checkIP = new Regex(
   "\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");

 private bool isValidIP(string ip)
 {
   return checkIP.IsMatch(ip);
 }

Declare once, reuse over and over. If you are taking the time to recreate the regex dynamically to save a few, don't get to do that. Technically you could do that and still only create the object once, but that is a lot more work than say, moving it to a class level variable.


Need Your Help

jqGrid data not loading

jquery asp.net-mvc json jqgrid asp.net-web-api

im working on an MVC project that also includes a WEB API project. Basically im making a call from my MVC project to the API project to query data that will appear in a jqGrid. However, I cannot ge...

how to use the name of the input file in sed replace

sed filenames

i have several files in which i want to replace a certain word with the name of the file itself..