HPUX collate8[4]

collate8(4) collate8(4)
NAME
collate8 - collating sequence table for languages with 8-bit character
sets
DESCRIPTION
There are four language dependent collation algorithms for European
languages. These algorithms are:
Two_to_one conversions:
Some languages such as Spanish require two adjacent
characters to occupy one position in the collating
sequence. Examples are ``CH'' (which follows ``C'')
and ``LL'' (which follows ``L'').
One_to_two conversions:
Some languages such as German require one character
(e.g. ``sharp S'') to occupy two adjacent positions in
the collating sequence.
Don't-care characters:
Some languages designate certain characters to be
ignored in character comparisons. For example, if - is
a ``don't-care'' character, the strings REACT and RE-
ACT would equal each other when compared.
Case and accent priority:
Many languages require a ``two-pass'' collating
algorithm: in pass one, the accents are stripped off
the letters and the resulting two strings are compared;
if they are equal, a second pass with the accents
replaced is performed to break the tie.
Uppercase/lowercase differentiation of letters can also
be handled in this fashion.
Table Description
The collating-sequence table has four sections: a file header, a
sequence table, a two_to_one mapping table, and a one_to_two mapping
table.
File Header:
The file header has the following format:
struct header {
short int table_len; /* Table length */
short int lang_id; /* Language id number */
short int reserved1; /* Reserved */
short int seq_tab; /* Address of sequence table */
short int seq_len; /* Length of sequence table */
short int two_to_one; /* Address of two_to_one table */
short int two_to_one_len; /* Length of two_to_one table */
short int one_to_two; /* Address of one_to_two table */
Hewlett-Packard Company - 1 - HP-UX Release 9.0: August 1992
collate8(4) collate8(4)
short int one_to_two_len; /* Length of one_to_two table */
char low_char; /* Lowest character */
char high_char; /* Highest character */
}
Sequence Table:
Sequence table entries have the following format:
struct seq_ent {
unsigned char seq_no; /* Sequence number */
unsigned char type_info; /* Character type */
}
The byte value of a given character is used as an index into the
sequence table. The first two bits of type_info are used to keep
track of the character type. A value zero means the character is a
one_to_one character, and the other six bits in type_info contain its
priority. A value of one or two means that type_info contains an
index value into either the two_to_one or the one_to_two mapping table
respectively. A value zero in seq_no means the character is a ``don't
care'' character.
Mapping Table for two_to_one Mapped Characters:
Entries in the two_to_one table have the following format:
struct two_to_one {
char reserved1; /* Reserved */
char legal_char; /* Legal character */
struct seq_ent seq2; /* Sequence entry for this pair */
}
``Legal'' two_to_one characters are listed for each particular
character. ``Legal'' means that the combination of two characters is
treated as a single character. If a match is found, the corresponding
sequence entry is used for the two. Whenever a legal successor is not
found in table, the character is treated according to one_to_one
mapping, and the priority in the last entry combined with sequence
number of the character creates the sequence entry.
Mapping Table for one_to_two Mapped Characters:
Entries in the one_to_two mapping table have the same format as
entries in the sequence table. The sequence number of the first
character is known from the entry in the sequence table. The sequence
number of the second character is found in the one_to_two mapping
entry, and the priority is used for both characters.
WARNING
This file is provided for historical reasons only. The recommended
interface for native language support collation is the routines
nl_strcmp() and nl_strncmp() (see string(3C)).
Hewlett-Packard Company - 2 - HP-UX Release 9.0: August 1992
collate8(4) collate8(4)
AUTHOR
collate8 was developed by the Hewlett-Packard Company.
SEE ALSO
sort(1), nl_string(3C).
Hewlett-Packard Company - 3 - HP-UX Release 9.0: August 1992