assembly - 如何检查字符是否在某些 ascii 值范围内？

Question

如何检查字符是否介于 0-9、AZ 和 az 之间？我知道您可以使用 cmp char、'A' 或 cmp char、'0' 等。但是，如果我必须检查三个不同的范围，我该怎么做？

如果我需要检查'A'<= C <='Z'，那么我必须先检查字符值是否低于A，然后检查它是否小于或等于Z。但由于0-9是低于A，我如何在不搞乱逻辑的情况下解释这一点？Z 也是如此，因为 az 高于 Z。到目前为止，按照我的逻辑发布。我因为没有得到简单的东西而感到很愚蠢，但我是一个初学者，我已经为此工作了几天，现在我不得不重新开始，所以任何帮助将不胜感激。

_asm
{
   mov ecx, 127
   mov esi, 0
   mov ebx,LocalBuffer[esi] ;LocalBuffer is a c++ array 

Loop1:
   cmp ebx, 'a'     ;ebx is the 0'th index value of LocalBuffer
   jb notLowercase  ;If character value is below 'a'
   cmp ebx,'z'
   jbe CharCount    ;if it's less than or equal to 'z' 
   cmp ebx,'A'
   jb notUpperCase ;If less than 'A', but then won't this discard 0-9?
   cmp ebx,'Z'
   jb CharCount    ;If it's less than 'Z', but what about greater than Z?
   cmp ebx,'0'
   jb NotDigit     ;If less than '0'
   cmp ebx,'9'
   jb CharCount    ;What if it's greater than 9?


notLowerCase:  
;DO I LOOP BACK TO LOOP1, MOVE ON TO THE NEXT CHARACTER OR SOMETHING ELSE? 

notUpperCase:
;SAME ISSUE AS NotLowerCase

notDigit:
;SAME ISSUE AS LAST 2

CharCount:
;Do something

score 1 · Accepted Answer

First of all, you can't debug your branching until you fix How to load a single byte from address in assembly - you're loading 4 bytes of characters and comparing that whole 32-bit value against 'a' and so on. Use movzx instead of mov ebx, LocalBuffer[esi] because it's a char array.

If you've been single-stepping your code in the debugger, maybe you've noticed that all 4 bytes of ebx are non-zero. That's why your cmp/branches aren't working or doing what you expect.

@zx485 explained the general case of a chain of branches to go through until you can definitely accept or reject an input.

But you can also simplify by using efficient range-checks using the unsigned-compare trick. e.g. Reverse-engineering asm using sub / cmp / setbe back to C? My attempt is compiling to branches shows how that works for just the lower-case ASCII range.

Even better, ASCII is conveniently designed so the A-Z and a-z ranges align with each other, and don't cross a %32 boundary, so you can force a byte to lower-case with c |= 0x20, or to upper case with c ^= ~0x20. Then you only have that one range to check for alphabetic characters.

OR with 20h forces upper-case characters to lower-case, and doesn't make any non-alphabetic characters into lowercase, so you can do that on a copy of your register.

See What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa? and especially How to access a char array and change lower case letters to upper case, and vice versa for MSVC inline asm that loops over a char array and checks for alphabetic or not.

Make sure you don't destroy your only copy because you still need to count upper separately from lower; you're just creating a temporary to branch on. Unless you want to avoid unused positions in your count array, then maybe you want c - 'A' as your array index. But probably not if you have one array for all characters and digits you want to count.

Example

For the loop structure, I have out-of-range characters jump over the Do Something part, reaching the compare/branch loop condition. The load and index increment happens every iteration, regardless of the loaded character.

Note that every character that's not in any of the ranges is a non-digit and a non-letter. It doesn't make sense to have a non-digit branch target separate from a non-letter branch target because that's not what you're figuring out. You could have digit and letter branch to separate places, though.

_asm
{
   xor  esi, esi   ; i=0

Loop1:                           ; do {
   ; load from the array *inside* the loop.
   movzx ebx, byte ptr LocalBuffer[esi]
   inc   esi                          ; ebp = buf[i++]

 ; check for digits first
   lea   eax, [ebx - '0']
   cmp   al, 9
   jbe   CharCount                    ; if (c-'0' <= 9) goto CharCount
 ; non-digits fall through into checking for alphabetic

   mov   eax, ebx
   or    eax, 20h       ; force to lower-case
   sub   eax, 'a'       ; subtract start of the range
   cmp   al, 'z'-'a'    ; see if it was inside the length of the range (unsigned)
   ja    skipCount
; in the common case (alphabetic characters), fall through into CharCount

CharCount:
; EBX still holds the character value, zero-extended
   add  byte ptr [counts + ebx], 1       ;Do something
    ; or use  [counts + ebx*4] if you have an int array.

skipCount:    ; rejected characters jump here, skipping count increment
   cmp  esi, 127
   jb   Loop1               ; } while(i<127)
}

You don't need to waste a 2nd register on another loop counter (ECX) when you already have ESI. cmp/jb is more efficient than the loop instruction anyway.

I think we can save one instruction by doing the subtract first (so we can still use lea to copy-and-subtract), but then we have to clear the 0x20 bit instead of setting it so we're dealing with upper-case.

;; untested, but I think this is correct, too, using LEA+AND instead of MOV+OR+SUB
   lea   eax, [ebx - 'A']
   and   eax, ~20h        ; clear the lower-case bit
   cmp   al, 'Z'-'A'      ; 25, same as 'z'-'a' of course.
   ja    skipCount

c - 'A' = 0x20 for c='a'. Character codes past 'Z' but before 'a' produce smaller results so clearing the 0x20 bit can't give us a false-positive.

PS: if this is the same histogram problem you asked previous questions about, you don't need to filter while reading, just make your array of counts have 256 elements (for every possible uint8_t value) and then only loop over the ones you want to print.

If you were getting segfaults using ebx as the index, that's because you loaded 4 bytes (a large integer) instead of zero-extending one. We already fixed this bug in previous versions of you question.

Also, as I previously explained in comments, you don't need to copy your string input to a LocalBuffer, just do char *bufptr = Buffer; and in inline asm do mov esi, bufptr to get that pointer into a register. That's inefficient, but much better than copying a whole array. Especially for counts as well.

Or https://godbolt.org/z/QszVMf shows how to access class members from inline asm.

score 1 · Accepted Answer

一种简单的方法是以升序（或降序）方式对范围进行排序。然后您可以cmp在 ON/OFF 样式中使用 s：

   mov ecx, 127    ; Check a 127 char string
   mov esi, 0
Loop1:
   movzx ebx, byte ptr LocalBuffer[esi]   ; Load a byte from the address  
   cmp bl, '0'     ; '0' = 48 - all lower values mask are NOT IN THE SET
   jb  notInSet    ; 
   cmp bl,'9'      ; '9' = 57 - all lower are IN THE SET
   jbe CharCount   ; It is a number 
   cmp bl,'A'      ; 'A' = 65 - all lower are NOT IN THE SET
   jb  notInSet    ; If less than 'A'
   cmp bl,'Z'      ; 'Z' = 90 - all lower are IN THE SET
   jbe CharCount   ; It is an uppercase char
   cmp bl,'a'      ; 'a' = 97 - all lower are NOT IN THE SET
   jb  NotInSet    ; 
   cmp bl,'z'      ; 'z' = 122 - all lower are IN THE SET
   jbe CharCount   ; It is a lowercase letter
   ; FALL THROUGH for greater values
notInSet:  
   inc esi
   loop Loop1
   jmp Final

CharCount:
   ; DO SOMETHING (that doesn't mess up ECX, ESI)
   inc esi
   loop Loop1
   ; FALL THROUGH to Final

Final:
   ; END of this snippet

如您所见，检查的值确实会上升。比如一个值3（=51）会先检查是否低于48（=NO），再检查是否低于57（=YES），所以进行第二次跳转。

另一种方法是使用带有索引寻址的跳转表。在这种方法中，您将范围定义为布尔值表（0=NotInSet，1=CharCount）：

该表应在.data您的场景的段中设置为这样（注意和的交替值，0上面1提到的 ON/OFF 样式）：

.data
  JumpTable db 48 dup(0), 10 dup(1), 7 dup(0), 26 dup(1), 7 dup(0), 26 dup(1), 133 dup(0)

然后代码可能如下所示：

   mov ecx, 127
   mov esi, 0
Loop1:
   movzx ebx, byte ptr LocalBuffer[esi]   ; Load a byte from the address  
   movzx eax, byte ptr JumpTable[ebx]     ; Retrieve the ebx'th value of the table[
   test eax, eax    ; Check if it's zero
   jnz  CharCount   ; If it's not, it's a char, so jump to CharCount 
   ; FALL THROUGH TO notInSet
notInSet:  
   inc esi
   loop Loop1
   jmp Final

CharCount:
   ; DO SOMETHING (that doesn't mess up ECX, ESI)
   inc esi
   loop Loop1
   ; FALL THROUGH to Final

Final:
   ; END of this snippet

该表有 256 个值，完整的 ASCII 范围，0 或 1。

在这两种情况下，您都可以inc esi在读取值之后将移到开头movzx ebx, byte ptr LocalBuffer[esi]。

assembly - 如何检查字符是否在某些 ascii 值范围内？

2 回答 2

Example

Related

Reference