Guillaume Matheron

Data scientist, PhD in computer science

IP-based geolocation in Clickhouse with IPv6


While setting up my analytics with clickhouse, I tried to map IPs to countries. I found this article that perfectly explains how to do this with IPv4, but gives no indication of how to adapt the queries to ipv6.

The beginning of the process is identical: fetch IP ranges from dbip-city-ipv6 using a table with a URL engine, then create a MergeTree table to hold the IP ranges in CIDR notation.

CREATE TABLE geoip_url6 (
    `ip_range_start` IPv6,
    `ip_range_end` IPv6,
    `country_code` Nullable(String),
    `state1` Nullable(String),
    `state2` Nullable(String),
    `city` Nullable(String),
    `postcode` Nullable(String),
    `latitude` Float64,
    `longitude` Float64,
    `timezone` Nullable(String)
)
ENGINE = URL('https://raw.githubusercontent.com/sapics/ip-location-db/master/dbip-city/dbip-city-ipv6.csv.gz', 'CSV');

CREATE TABLE geoip6 (
    `cidr` String,
    `latitude` Float64,
    `longitude` Float64,
    `country_code` String,
    `city` String
)
ENGINE = MergeTree
ORDER BY cidr;

Unfortunately the next step proved much harder, simply because the clickhouse function bitXor which is used in the original article supports values up to 64 bits, only half of what is required for an IPv6 address.

Working around this issue was much harder than I anticipated, but it mostly boiled down to:

  • Convert the IPv6 to a hexadecimal string representation (without forgetting to left-pad with zeros) wiht hex and leftPad.
  • Split this heximal representation into two parts with substr , then convert this back to a UInt64 with unhex, reverse and reinterpretAsUInt64. (see the note in the clickhouse reference for unhex).
  • Finally we can use bitXor to treat the most significant bits and least significant bits separately, then convert everything back to UInt128 using hex, leftPad, concat and then again unhex, reverse and reinterpretAsUInt128.
  • The rest of the query is mostly identical, with a bit of string manipulations to generate the mask and CIDR string.
insert into geoip6
select cidr, latitude, longitude, country_code, city from (
    with
        reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_start),32,'0'),1,16)))) as start_msb,
        reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_start),32,'0'),17)))) as start_lsb,
        reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_end),32,'0'),1,16)))) as end_msb,
        reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_end),32,'0'),17)))) as end_lsb,
        bitXor(start_msb, end_msb) as xor_msb,
        bitXor(start_lsb, end_lsb) as xor_lsb,
        leftPad(hex(xor_msb),16,'0') as xor_hex_msb,
        leftPad(hex(xor_lsb),16,'0') as xor_hex_lsb,
        concat(xor_hex_msb,xor_hex_lsb) as xor_hex,
        reinterpretAsUInt128(reverse(unhex(xor_hex))) as xor,
    	ceil(log2(xor+1))::UInt32 as unmatched,
        128-unmatched as cidr_suffix,
        reinterpretAsUInt128(reverse(unhex(hex(ip_range_start)))) as addr_start_uint128,
        reinterpretAsUInt128(reverse(unbin(leftPad(repeat('0', unmatched),128,'1')))) as mask
    select
        ip_range_start,
        ip_range_end,
        IPv6NumToString(toFixedString(unhex(hex(bitAnd(mask, addr_start_uint128))),16)) as cidr_address,
        cidr_suffix,
        concat(toString(cidr_address),'/',toString(cidr_suffix)) as cidr,
        latitude, longitude, country_code, city
    from geoip_url6
)

Finally, create the trie dictionnary:

CREATE DICTIONARY ip_trie6 (
    `cidr` String,
    `latitude` Float64,
    `longitude` Float64,
    `country_code` String,
    `city` String
)
PRIMARY KEY cidr
SOURCE(CLICKHOUSE(TABLE geoip6 USER 'default' PASSWORD '[HIDDEN]' DB 'mynalytics'))
LIFETIME(MIN 0 MAX 3600)
LAYOUT(IP_TRIE);

And perform lookups in the same way as with an IPv4 trie. I did not attempt to merge ipv4 and ipv6 tables, although RFC 6052 provides a mechanism for embedding IPv4 addresses into IPv6.

Lookups then look like this:

with
  dictGet(
    'ip_trie',
    ('country_code', 'latitude', 'longitude', 'city'),
    tuple(IPv4StringToNumOrDefault(min(ip)))
  ) as d4,
  dictGet(
    'ip_trie6',
    ('country_code', 'latitude', 'longitude', 'city'),
    tuple(IPv6StringToNumOrDefault(min(ip)))
  ) as d6
SELECT
  if(d4.1='', d6.1, d4.1) as country,
  *
FROM hits

2 responses to “IP-based geolocation in Clickhouse with IPv6”

  1. I know the original IPv4 article did it too, but the whole `if` nonsense is actually incorrect. `ceil(log2(xor+1))` will get you the right answer.

    In the case where xor = 1, you want the cidr mask to be 31 (v4) or 127 (v6). However, ceil(log2(1)) = 0, making your mask be 32 (or 128). Whoops.

    This is actually a problem at every bit transition. xor = 2, you want the cidr mask to be /30, but ceil(log2(2)) = 1.
    xor = 4, you want the cidr mask to be /29, but ceil(log2(4)) = 2.
    And so on…

    • Very nice catch! I tried to quantify the impact of this error.

      == IPv6 ==
      Out of the 2.8M records in geoip6, only 24% were using the correct CIDR.
      Out of 6184 distinct IPv6 that hit my website, the old method got the incorrect city in about half of cases, but the wrong country in only 84 cases.
      When trying to find a third source to confirm, I found that most of these IPv6 locations are very accurate anyways.

      == IPv4 ==
      Out of the 3.1M records in geoip, 52% were using the correct CIDR.
      Out of the 5331 distinct IPv4 that hit my website, the old method got the incorrect city in 42% of cases, and the wrong country in only 31 cases.
      A separate source gave me a bit more confidence in the results, but also made me realize that IP geolocation in general is not an exact science at all.

      I’m fixing the article, so for reference the original erroneous line was:
      “` if(xor != 0, ceil(log2(xor))::UInt32, 0) as unmatched, “`

Leave a Reply

Your email address will not be published. Required fields are marked *