While setting up my analytics with clickhouse, I tried to map IPs to countries. I found this article that perfectly explains how to do this with IPv4, but gives no indication of how to adapt the queries to ipv6.
The beginning of the process is identical: fetch IP ranges from dbip-city-ipv6 using a table with a URL engine, then create a MergeTree table to hold the IP ranges in CIDR notation.
CREATE TABLE geoip_url6 (
`ip_range_start` IPv6,
`ip_range_end` IPv6,
`country_code` Nullable(String),
`state1` Nullable(String),
`state2` Nullable(String),
`city` Nullable(String),
`postcode` Nullable(String),
`latitude` Float64,
`longitude` Float64,
`timezone` Nullable(String)
)
ENGINE = URL('https://raw.githubusercontent.com/sapics/ip-location-db/master/dbip-city/dbip-city-ipv6.csv.gz', 'CSV');
CREATE TABLE geoip6 (
`cidr` String,
`latitude` Float64,
`longitude` Float64,
`country_code` String,
`city` String
)
ENGINE = MergeTree
ORDER BY cidr;
Unfortunately the next step proved much harder, simply because the clickhouse function bitXor which is used in the original article supports values up to 64 bits, only half of what is required for an IPv6 address.
Update (2024-04-25) : recent versions of clickhouse now support 128-bits (and ipv6) arithmetics !
Working around this issue was much harder than I anticipated, but it mostly boiled down to:
- Convert the IPv6 to a hexadecimal string representation (without forgetting to left-pad with zeros) wiht
hex
andleftPad
. - Split this heximal representation into two parts with
substr
, then convert this back to a UInt64 withunhex
,reverse
andreinterpretAsUInt64
. (see the note in the clickhouse reference forunhex
). - Finally we can use bitXor to treat the most significant bits and least significant bits separately, then convert everything back to
UInt128
usinghex
,leftPad
,concat
and then againunhex
,reverse
andreinterpretAsUInt128
. - The rest of the query is mostly identical, with a bit of string manipulations to generate the mask and CIDR string.
insert into geoip6
select cidr, latitude, longitude, country_code, city from (
with
reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_start),32,'0'),1,16)))) as start_msb,
reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_start),32,'0'),17)))) as start_lsb,
reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_end),32,'0'),1,16)))) as end_msb,
reinterpretAsUInt64(reverse(unhex(substr(leftPad(hex(ip_range_end),32,'0'),17)))) as end_lsb,
bitXor(start_msb, end_msb) as xor_msb,
bitXor(start_lsb, end_lsb) as xor_lsb,
leftPad(hex(xor_msb),16,'0') as xor_hex_msb,
leftPad(hex(xor_lsb),16,'0') as xor_hex_lsb,
concat(xor_hex_msb,xor_hex_lsb) as xor_hex,
reinterpretAsUInt128(reverse(unhex(xor_hex))) as xor,
ceil(log2(xor+1))::UInt32 as unmatched,
128-unmatched as cidr_suffix,
reinterpretAsUInt128(reverse(unhex(hex(ip_range_start)))) as addr_start_uint128,
reinterpretAsUInt128(reverse(unbin(leftPad(repeat('0', unmatched),128,'1')))) as mask
select
ip_range_start,
ip_range_end,
IPv6NumToString(toFixedString(unhex(leftPad(hex(bitAnd(mask, addr_start_uint128)),32,'0')),16)) as cidr_address,
cidr_suffix,
concat(toString(cidr_address),'/',toString(cidr_suffix)) as cidr,
latitude, longitude, country_code, city
from geoip_url6
)
Finally, create the trie dictionnary:
CREATE DICTIONARY ip_trie6 (
`cidr` String,
`latitude` Float64,
`longitude` Float64,
`country_code` String,
`city` String
)
PRIMARY KEY cidr
SOURCE(CLICKHOUSE(TABLE geoip6 USER 'default' PASSWORD '[HIDDEN]' DB 'mynalytics'))
LIFETIME(MIN 0 MAX 3600)
LAYOUT(IP_TRIE);
And perform lookups in the same way as with an IPv4 trie. I did not attempt to merge ipv4 and ipv6 tables, although RFC 6052 provides a mechanism for embedding IPv4 addresses into IPv6.
Lookups then look like this:
with
dictGet(
'ip_trie',
('country_code', 'latitude', 'longitude', 'city'),
tuple(IPv4StringToNumOrDefault(min(ip)))
) as d4,
dictGet(
'ip_trie6',
('country_code', 'latitude', 'longitude', 'city'),
tuple(IPv6StringToNumOrDefault(min(ip)))
) as d6
SELECT
if(d4.1='', d6.1, d4.1) as country,
*
FROM hits
8 responses to “IP-based geolocation in Clickhouse with IPv6”
I know the original IPv4 article did it too, but the whole `if` nonsense is actually incorrect. `ceil(log2(xor+1))` will get you the right answer.
In the case where xor = 1, you want the cidr mask to be 31 (v4) or 127 (v6). However, ceil(log2(1)) = 0, making your mask be 32 (or 128). Whoops.
This is actually a problem at every bit transition. xor = 2, you want the cidr mask to be /30, but ceil(log2(2)) = 1.
xor = 4, you want the cidr mask to be /29, but ceil(log2(4)) = 2.
And so on…
Very nice catch! I tried to quantify the impact of this error.
== IPv6 ==
Out of the 2.8M records in geoip6, only 24% were using the correct CIDR.
Out of 6184 distinct IPv6 that hit my website, the old method got the incorrect city in about half of cases, but the wrong country in only 84 cases.
When trying to find a third source to confirm, I found that most of these IPv6 locations are very accurate anyways.
== IPv4 ==
Out of the 3.1M records in geoip, 52% were using the correct CIDR.
Out of the 5331 distinct IPv4 that hit my website, the old method got the incorrect city in 42% of cases, and the wrong country in only 31 cases.
A separate source gave me a bit more confidence in the results, but also made me realize that IP geolocation in general is not an exact science at all.
I’m fixing the article, so for reference the original erroneous line was:
“` if(xor != 0, ceil(log2(xor))::UInt32, 0) as unmatched, “`
Looks like something is very wrong with this function, so please remove it so noone copies this before you have checked. An example is the range:
toIPv6(‘8.8.0.0’) – toIPv6(‘8.9.0.0’), the result of toIPv6 is ::ffff:8.8.0.0, but after calculating mask, the range becomes ffff:808::
Indeed, a leftPad was missing which gave wrong results in some cases. I updated the article, the line
`IPv6NumToString(toFixedString(unhex(hex(bitAnd(mask, addr_start_uint128))),16)) as cidr_address,`
was changed to
`IPv6NumToString(toFixedString(unhex(leftPad(hex(bitAnd(mask, addr_start_uint128)),32,’0′)),16)) as cidr_address`
Thanks for the heads-up, and let me know if you find other issues!
For anyone impacted, as a reference this change invalidated the city of about 20% of my hits, and less than 1% of the countries.
Is this really the easyest way of accomplishing this? Can not a direct bitXor on the 128bit toIPv6() be used?
Unfortunately toIPv6 will only accept a string
What is incorrect with this way?
WITH
toIPv6(‘127.0.0.1’) AS ip_range_start,
toIPv6(‘127.0.0.10’) AS ip_range_end,
128 – CAST(ceil(log2(bitXor(ip_range_start, ip_range_end) + 1)), ‘UInt32’) AS mask,
IPv6NumToString(toFixedString(unbin(rightPad(substr(bin(ip_range_start), 1, mask), 128, ‘0’)), 16)) AS cidr
SELECT concat(cidr, ‘/’, mask)
FORMAT vertical
This way is fine, but it requires a recent of version of clickhouse since calling bitXor on IPv6 is a recent addition (august 2023, after this article was written) https://github.com/ClickHouse/ClickHouse/pull/57707
Actually the documentation is still outdated on this and doesn’t mention ipv6 or even UInt128 as being compatible with bitXor https://clickhouse.com/docs/en/sql-reference/functions/bit-functions#bitxora-b