Guillaume Matheron

Data scientist, PhD in computer science

IP-based geolocation in Clickhouse with IPv6


While setting up my analytics with clickhouse, I tried to map IPs to countries. I found this article that perfectly explains how to do this with IPv4, but gives no indication of how to adapt the queries to ipv6.

The beginning of the process is identical: fetch IP ranges from dbip-city-ipv6 using a table with a URL engine, then create a MergeTree table to hold the IP ranges in CIDR notation.

CREATE TABLE geoip_url6 (
    `ip_range_start` IPv6,
    `ip_range_end` IPv6,
    `country_code` Nullable(String),
    `state1` Nullable(String),
    `state2` Nullable(String),
    `city` Nullable(String),
    `postcode` Nullable(String),
    `latitude` Float64,
    `longitude` Float64,
    `timezone` Nullable(String)
)
ENGINE = URL('https://raw.githubusercontent.com/sapics/ip-location-db/master/dbip-city/dbip-city-ipv6.csv.gz', 'CSV');

CREATE TABLE geoip6 (
    `cidr` String,
    `latitude` Float64,
    `longitude` Float64,
    `country_code` String,
    `city` String
)
ENGINE = MergeTree
ORDER BY cidr;

Unfortunately the next step proved much harder, simply because the clickhouse function bitXor which is used in the original article supports values up to 64 bits, only half of what is required for an IPv6 address.

Working around this issue was much harder than I anticipated, but it mostly boiled down to:

  • Convert the IPv6 to a hexadecimal string representation (without forgetting to left-pad with zeros) wiht hex and leftPad.
  • Split this heximal representation into two parts with substr , then convert this back to a UInt64 with unhex, reverse and reinterpretAsUInt64. (see the note in the clickhouse reference for unhex).
  • Finally we can use bitXor to treat the most significant bits and least significant bits separately, then convert everything back to UInt128 using hex, leftPad, concat and then again unhex, reverse and reinterpretAsUInt128.
  • The rest of the query is mostly identical, with a bit of string manipulations to generate the mask and CIDR string.
insert into geoip6
select cidr, latitude, longitude, country_code, city from (
    with
        reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_start),32,'0'),1,16)))) as start_msb,
        reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_start),32,'0'),17)))) as start_lsb,
        reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_end),32,'0'),1,16)))) as end_msb,
        reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_end),32,'0'),17)))) as end_lsb,
        bitXor(start_msb, end_msb) as xor_msb,
        bitXor(start_lsb, end_lsb) as xor_lsb,
        leftPad(hex(xor_msb),16,'0') as xor_hex_msb,
        leftPad(hex(xor_lsb),16,'0') as xor_hex_lsb,
        concat(xor_hex_msb,xor_hex_lsb) as xor_hex,
        reinterpretAsUInt128(reverse(unhex(xor_hex))) as xor,
    	if(xor != 0, ceil(log2(xor))::UInt32, 0) as unmatched,
        128-unmatched as cidr_suffix,
        reinterpretAsUInt128(reverse(unhex(hex(ip_range_start)))) as addr_start_uint128,
        reinterpretAsUInt128(reverse(unbin(leftPad(repeat('0', unmatched),128,'1')))) as mask
    select
        ip_range_start,
        ip_range_end,
        IPv6NumToString(toFixedString(unhex(hex(bitAnd(mask, addr_start_uint128))),16)) as cidr_address,
        cidr_suffix,
        concat(toString(cidr_address),'/',toString(cidr_suffix)) as cidr,
        latitude, longitude, country_code, city
    from geoip_url6
)

Finally, create the trie dictionnary:

CREATE DICTIONARY ip_trie6 (
    `cidr` String,
    `latitude` Float64,
    `longitude` Float64,
    `country_code` String,
    `city` String
)
PRIMARY KEY cidr
SOURCE(CLICKHOUSE(TABLE geoip6 USER 'default' PASSWORD '[HIDDEN]' DB 'mynalytics'))
LIFETIME(MIN 0 MAX 3600)
LAYOUT(IP_TRIE);

And perform lookups in the same way as with an IPv4 trie. I did not attempt to merge ipv4 and ipv6 tables, although RFC 6052 provides a mechanism for embedding IPv4 addresses into IPv6.

Lookups then look like this:

with
  dictGet(
    'ip_trie',
    ('country_code', 'latitude', 'longitude', 'city'),
    tuple(IPv4StringToNumOrDefault(min(ip)))
  ) as d4,
  dictGet(
    'ip_trie6',
    ('country_code', 'latitude', 'longitude', 'city'),
    tuple(IPv6StringToNumOrDefault(min(ip)))
  ) as d6
SELECT
  if(d4.1='', d6.1, d4.1) as country,
  *
FROM hits

Leave a Reply

Your email address will not be published. Required fields are marked *