CLUSTERING BTC ADDRESSES WITH THEIR CHANGE ADDRESSES
Based on https://flipsidecrypto.xyz/fsds/how-to-get-bitcoin-transfers-MHbE3A
I found an interesting point in "How to Get Bitcoin Transfers" (https://flipsidecrypto.xyz/fsds/how-to-get-bitcoin-transfers-MHbE3A ) that mentions the difficulty in determining if an input address shares the same private key as a change address until they both appear as inputs in another transaction.
However, in this document (https://wwz.unibas.ch/fileadmin/user_upload/wwz/00_Professuren/Schaer_DLTFintech/Lehre/Siegenthaler_Jessica_Masterthesis-1.pdf), the author suggests a different approach. In a transaction with one input and two outputs, if one output has four or fewer decimals and the other has more, the latter could be the change address, implying a shared private key with the input.
I've crafted a query to cluster inputs with their change addresses, aiming to assign entity IDs and reduce the total entities in Bitcoin history. The list needs refining, considering change addresses being inputs in subsequent transactions and repeated change addresses with different inputs.
Would love to hear your thoughts!
Cheers
-- CLUSTERING INPUT ADDRESSES WITH THEIR CHANGE ADDRESSES
-- IDENTIFYING TX WITH 1 INPUTS AND 2 OUTPUTS
with first_selection as ( SELECT tx_hash, input_count, output_count from bitcoin.core.fact_transactions where input_count = 1 and output_count = 2 ),
-- ADDING OUTPUTS WITH THEIR VALUES
outputs_first_selection as ( SELECT a.tx_hash, b.PUBKEY_SCRIPT_ADDRESS as output, b.value from first_selection a left join bitcoin.core.fact_outputs b on a.tx_hash = b.tx_id ),
-- RECEIVERS LIST. VALUE <= 4 DECIMALS
receivers as (
SELECT
tx_hash,
output as receiver,
value
from outputs_first_selection
where length(substring(CAST(value AS STRING), position('.', CAST(value AS STRING))+1)) <= 4
),
-- CHANGES LIST. VALUE > 4 DECIMALS
changes as (
SELECT
tx_hash,
output as change,
value
from outputs_first_selection
where length(substring(CAST(value AS STRING), position('.', CAST(value AS STRING))+1)) > 4
),
--PICKING ONLY TXS WITH ONE RECEIVER AND ONE CHANGE
changes_filtered as ( SELECT c.tx_hash, change from changes c inner join receivers r on c.tx_hash = r.tx_hash )
-- CLUSTERING TOGETHER INPUTS WITH CHANGES (SAME INPUT SAME ID)
select distinct --ch.tx_hash, i.pubkey_script_address as input, ch.change, dense_rank() over(order by input) as cluster_id from changes_filtered ch left join bitcoin.core.fact_inputs i on ch.tx_hash = i.spent_tx_id having change != input ORDER BY 1