regex - Stata: Regex search and replace on Integer variables

Question

Explanation of DATA: Contains a surveyor ID and answers to various survey questions. If one of the answers to the survey is 99 or 999 or 9999 (ad infinitum), then that is a numerical representation of "No." If one of the answers to the survey is 98 or 998 or 988, 9988, 998888, etc., that is a numerical representation of "Yes." Most of the data is in integer form.

I want to replace all variables that have values which start with a '9' and end with a '9' with the word "No", and all variables that start with a '9' and end with an '8' with "Yes."

My current strategy is to transform every single variable into a string tostring _all, replace and then iterate thorugh all string_vars perform the following two regexes:

regexr(`value', "^[9]*[9]$","No")
regexr(`value', "^[9]*[8]$", "Yes")

Is there an easier way to do this without converting all values to strings?

score 2 · Accepted Answer

If you want to check for numeric variables that are all 9s another way is

  ... if subinstr(string(myvar, "%20.0f"), "9", "", .) == ""

where 20 is a upper limit to be replaced by the longest number you need.

You can find all integer-valued variables using findname (findit findname indicates download sources).

 findname, all(@==int(@)) local(intvars) 
 foreach v of local intvars { 
         gen s`v' = "YES" if subinstr(string(myvar, "%20.0f"), "9", "", .) == "" 
 }

may be part of what you want. Are there answers other than "YES" and "NO"?

score 1 · Accepted Answer

You can use inlist for that. var1 is the variable that contains these numbers

gen dummy=""
replace dummy ="NO" if inlist(var1,99,999,9999)
replace dummy ="YES" if inlist(var1,98,998,988)

With dummy in hand you can restrict the sample based on it.

OR,

If you don't have 0 or 1 in your var1, you can replace these with 0 and 1.

replace var1 =0 if inlist(var1,99,999,9999)
 replace var1 =1 if inlist(var1,98,998,988)

score 0 · Accepted Answer

If your numbers are either only 9s or something with 8 in the end, you don't need regex here. You could simply calculate the sum of the digits and check sum(digits) % 9. If it's 0, your answer is Yes, if it's not, your answer is No.

Even easier would be to check [your number] % 2, which will always be 0 for a number ending with 8 and always be 1 for a number ending with 9.

In case, you want to only make the first and last digit count and can't be sure, they're always either 9 or 8, you'll need two regexes. Your proposed ones are good, though you can omit the [] around the numbers, since a character class with only one char is equivalent to the char itself. So your regexes will be ^9*$ and ^9*8$.

Edit: Since it's now clear, the input will always start with a 9 and have at least two digits, it would be enough, to check input % 10. That way only the last digit will remain and you can check, if that's a 9 or an 8.

regex - Stata: Regex search and replace on Integer variables

3 回答 3

Related

Reference