Class SpoofChecker


  • public class SpoofChecker
    extends Object
    Unicode Security and Spoofing Detection.

    This class is intended to check strings, typically identifiers of some type, such as URLs, for the presence of characters that are likely to be visually confusing - for cases where the displayed form of an identifier may not be what it appears to be.

    Unicode Technical Report #36, http://unicode.org/reports/tr36 and Unicode Technical Standard #39, http://unicode.org/reports/tr39 "Unicode security considerations", give more background on security and spoofing issues with Unicode identifiers. The tests and checks provided by this module implement the recommendations from these Unicode documents.

    The tests available on identifiers fall into two general categories:

    • Single identifier tests. Check whether an identifier is potentially confusable with any other string, or is suspicious for other reasons.
    • Two identifier tests. Check whether two specific identifiers are confusable. This does not consider whether either of strings is potentially confusable with any string other than the exact one specified.

    The steps to perform confusability testing are

    • Create a SpoofChecker.Builder
    • Configure the Builder for the desired set of tests. The tests that will be performed are specified by a set of SpoofCheck flags.
    • Build a SpoofChecker from the Builder.
    • Perform the checks using the pre-configured SpoofChecker. The results indicate which (if any) of the selected tests have identified possible problems with the identifier. Results are reported as a set of SpoofCheck flags; this mirrors the form in which the set of tests to perform was originally specified to the SpoofChecker.

    A SpoofChecker instance may be used repeatedly to perform checks on any number of identifiers.

    Thread Safety: The methods on SpoofChecker objects are thread safe. The test functions for checking a single identifier, or for testing whether two identifiers are potentially confusable, may called concurrently from multiple threads using the same SpoofChecker instance.

    Descriptions of the available checks.

    When testing whether pairs of identifiers are confusable, with areConfusable() the relevant tests are

    • SINGLE_SCRIPT_CONFUSABLE: All of the characters from the two identifiers are from a single script, and the two identifiers are visually confusable.
    • MIXED_SCRIPT_CONFUSABLE: At least one of the identifiers contains characters from more than one script, and the two identifiers are visually confusable.
    • WHOLE_SCRIPT_CONFUSABLE: Each of the two identifiers is of a single script, but the the two identifiers are from different scripts, and they are visually confusable.

    The safest approach is to enable all three of these checks as a group.

    ANY_CASE is a modifier for the above tests. If the identifiers being checked can be of mixed case and are used in a case-sensitive manner, this option should be specified.

    If the identifiers being checked are used in a case-insensitive manner, and if they are displayed to users in lower-case form only, the ANY_CASE option should not be specified. Confusabality issues involving upper case letters will not be reported.

    When performing tests on a single identifier, with the check() family of functions, the relevant tests are:

    • MIXED_SCRIPT_CONFUSABLE: the identifier contains characters from multiple scripts, and there exists an identifier of a single script that is visually confusable.
    • WHOLE_SCRIPT_CONFUSABLE: the identifier consists of characters from a single script, and there exists a visually confusable identifier. The visually confusable identifier also consists of characters from a single script. but not the same script as the identifier being checked.
    • ANY_CASE: modifies the mixed script and whole script confusables tests. If specified, the checks will find confusable characters of any case. If this flag is not set, the test is performed assuming case folded identifiers.
    • SINGLE_SCRIPT: check that the identifier contains only characters from a single script. (Characters from the common and inherited scripts are ignored.) This is not a test for confusable identifiers
    • INVISIBLE: check an identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark. This check does not test the input string as a whole for conformance to any particular syntax for identifiers.
    • CHAR_LIMIT: check that an identifier contains only characters from a specified set of acceptable characters. See Builder.setAllowedChars() and Builder.setAllowedLocales().

    Note on Scripts:

    Characters from the Unicode Scripts "Common" and "Inherited" are ignored when considering the script of an identifier. Common characters include digits and symbols that are normally used with text from many different scripts.
    • Field Detail

      • INCLUSION

        @Deprecated
        public static final UnicodeSet INCLUSION
        Deprecated.
        This API is ICU internal only.
        Security Profile constant from UAX 31 for use in setAllowedChars. Will probably be replaced by UnicodeSet property.
      • RECOMMENDED

        @Deprecated
        public static final UnicodeSet RECOMMENDED
        Deprecated.
        This API is ICU internal only.
        Security Profile constant from UAX 31 for use in setAllowedChars. Will probably be replaced by UnicodeSet property.
      • SINGLE_SCRIPT_CONFUSABLE

        public static final int SINGLE_SCRIPT_CONFUSABLE
        Single script confusable test. When testing whether two identifiers are confusable, report that they are if both are from the same script and they are visually confusable. Note: this test is not applicable to a check of a single identifier.
        See Also:
        Constant Field Values
      • MIXED_SCRIPT_CONFUSABLE

        public static final int MIXED_SCRIPT_CONFUSABLE
        Mixed script confusable test.

        When checking a single identifier, report a problem if the identifier contains multiple scripts, and is also confusable with some other identifier in a single script.

        When testing whether two identifiers are confusable, report that they are if the two IDs are visually confusable, and and at least one contains characters from more than one script.

        See Also:
        Constant Field Values
      • WHOLE_SCRIPT_CONFUSABLE

        public static final int WHOLE_SCRIPT_CONFUSABLE
        Whole script confusable test.

        When checking a single identifier, report a problem if The identifier is of a single script, and there exists a confusable identifier in another script.

        When testing whether two Identifiers are confusable, report that they are if each is of a single script, the scripts of the two identifiers are different, and the identifiers are visually confusable.

        See Also:
        Constant Field Values
      • ANY_CASE

        public static final int ANY_CASE
        Any Case Modifier for confusable identifier tests.

        When specified, consider all characters, of any case, when looking for confusables. If ANY_CASE is not specified, identifiers being checked are assumed to have been case folded, and upper case conusable characters will not be checked.

        See Also:
        Constant Field Values
      • RESTRICTION_LEVEL

        @Deprecated
        public static final int RESTRICTION_LEVEL
        Deprecated.
        This API is ICU internal only.
        Check that an identifier is no looser than the specified RestrictionLevel. The default if this is not called is HIGHLY_RESTRICTIVE.
        See Also:
        Constant Field Values
      • SINGLE_SCRIPT

        @Deprecated
        public static final int SINGLE_SCRIPT
        Deprecated.
        ICU 51 Use RESTRICTION_LEVEL
        Check that an identifer contains only characters from a single script (plus chars from the common and inherited scripts.) Applies to checks of a single identifier check only.
        See Also:
        Constant Field Values
      • INVISIBLE

        public static final int INVISIBLE
        Check an identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark. This check does not test the input string as a whole for conformance to any particular syntax for identifiers.
        See Also:
        Constant Field Values
      • CHAR_LIMIT

        public static final int CHAR_LIMIT
        Check that an identifier contains only characters from a specified set of acceptable characters. See Builder.setAllowedChars() and Builder.setAllowedLocales().
        See Also:
        Constant Field Values
      • MIXED_NUMBERS

        @Deprecated
        public static final int MIXED_NUMBERS
        Deprecated.
        This API is ICU internal only.
        Check that an identifier does not mix numbers.
        See Also:
        Constant Field Values
      • ALL_CHECKS

        public static final int ALL_CHECKS
        Enable all spoof checks.
        See Also:
        Constant Field Values
    • Method Detail

      • getRestrictionLevel

        @Deprecated
        public SpoofChecker.RestrictionLevel getRestrictionLevel()
        Deprecated.
        This API is ICU internal only.
        Get the Restriction Level that is being tested.
        Returns:
        The restriction level
      • getChecks

        public int getChecks()
        Get the set of checks that this Spoof Checker has been configured to perform.
        Returns:
        The set of checks that this spoof checker will perform.
      • getAllowedLocales

        public Set<ULocale> getAllowedLocales()
        Get a read-only set of locales for the scripts that are acceptable in strings to be checked. If no limitations on scripts have been specified, an empty set will be returned. setAllowedChars() will reset the list of allowed locales to be empty. The returned set may not be identical to the originally specified set that is supplied to setAllowedLocales(); the information other than languages from the originally specified locales may be omitted.
        Returns:
        A set of locales corresponding to the acceptable scripts.
      • getAllowedJavaLocales

        public Set<Locale> getAllowedJavaLocales()
        Get a set of JDK locales for the scripts that are acceptable in strings to be checked. If no limitations on scripts have been specified, an empty set will be returned.
        Returns:
        A set of locales corresponding to the acceptable scripts.
      • getAllowedChars

        public UnicodeSet getAllowedChars()
        Get a UnicodeSet for the characters permitted in an identifier. This corresponds to the limits imposed by the Set Allowed Characters functions. Limitations imposed by other checks will not be reflected in the set returned by this function. The returned set will be frozen, meaning that it cannot be modified by the caller.
        Returns:
        A UnicodeSet containing the characters that are permitted by the CHAR_LIMIT test.
      • failsChecks

        public boolean failsChecks​(String text,
                                   SpoofChecker.CheckResult checkResult)
        Check the specified string for possible security issues. The text to be checked will typically be an identifier of some sort. The set of checks to be performed was specified when building the SpoofChecker.
        Parameters:
        text - A String to be checked for possible security issues.
        checkResult - Output parameter, indicates which specific tests failed. May be null if the information is not wanted.
        Returns:
        True there any issue is found with the input string.
      • failsChecks

        public boolean failsChecks​(String text)
        Check the specified string for possible security issues. The text to be checked will typically be an identifier of some sort. The set of checks to be performed was specified when building the SpoofChecker.
        Parameters:
        text - A String to be checked for possible security issues.
        Returns:
        True there any issue is found with the input string.
      • areConfusable

        public int areConfusable​(String s1,
                                 String s2)
        Check the whether two specified strings are visually confusable. The types of confusability to be tested - single script, mixed script, or whole script - are determined by the check options set for the SpoofChecker. The tests to be performed are controlled by the flags SINGLE_SCRIPT_CONFUSABLE MIXED_SCRIPT_CONFUSABLE WHOLE_SCRIPT_CONFUSABLE At least one of these tests must be selected. ANY_CASE is a modifier for the tests. Select it if the identifiers may be of mixed case. If identifiers are case folded for comparison and display to the user, do not select the ANY_CASE option.
        Parameters:
        s1 - The first of the two strings to be compared for confusability.
        s2 - The second of the two strings to be compared for confusability.
        Returns:
        Non-zero if s1 and s1 are confusable. If not 0, the value will indicate the type(s) of confusability found, as defined by spoof check test constants.
      • getSkeleton

        public String getSkeleton​(int type,
                                  String id)
        Get the "skeleton" for an identifier string. Skeletons are a transformation of the input string; Two strings are confusable if their skeletons are identical. See Unicode UAX 39 for additional information. Using skeletons directly makes it possible to quickly check whether an identifier is confusable with any of some large set of existing identifiers, by creating an efficiently searchable collection of the skeletons. Skeletons are computed using the algorithm and data describe in Unicode UAX 39. The latest proposed update, UAX 39 Version 8 draft 1, says "the tables SL, SA, and ML were still problematic, and discouraged from use in [Uniocde] 7.0. They were thus removed from version 8.0" In light of this, the default mapping data included with ICU 55 uses the Unicode 7 MA (Multi script Any case) table data for the other type options (Single Script, Any Case), (Single Script, Lower Case) and (Multi Script, Lower Case).
        Parameters:
        type - The type of skeleton, corresponding to which of the Unicode confusable data tables to use. The default is Mixed-Script, Lowercase. Allowed options are SINGLE_SCRIPT_CONFUSABLE and ANY_CASE_CONFUSABLE. The two flags may be ORed.
        id - The input identifier whose skeleton will be genereated.
        Returns:
        The output skeleton string.
      • equals

        @Deprecated
        public boolean equals​(Object other)
        Deprecated.
        This API is ICU internal only.
        Equality function. Return true if the two SpoofChecker objects incorporate the same confusable data and have enabled the same set of checks.
        Overrides:
        equals in class Object
        Parameters:
        other - the SpoofChecker being compared with.
        Returns:
        true if the two SpoofCheckers are equal.
        See Also:
        Object.hashCode()