CLUSTERING BTC ADDRESSES WITH THEIR CHANGE ADDRESSES

    Based on https://flipsidecrypto.xyz/fsds/how-to-get-bitcoin-transfers-MHbE3A

    Loading...

    I found an interesting point in "How to Get Bitcoin Transfers" (https://flipsidecrypto.xyz/fsds/how-to-get-bitcoin-transfers-MHbE3A ) that mentions the difficulty in determining if an input address shares the same private key as a change address until they both appear as inputs in another transaction.

    However, in this document (https://wwz.unibas.ch/fileadmin/user_upload/wwz/00_Professuren/Schaer_DLTFintech/Lehre/Siegenthaler_Jessica_Masterthesis-1.pdf), the author suggests a different approach. In a transaction with one input and two outputs, if one output has four or fewer decimals and the other has more, the latter could be the change address, implying a shared private key with the input.

    I've crafted a query to cluster inputs with their change addresses, aiming to assign entity IDs and reduce the total entities in Bitcoin history. The list needs refining, considering change addresses being inputs in subsequent transactions and repeated change addresses with different inputs.

    Would love to hear your thoughts!

    Cheers

    -- CLUSTERING INPUT ADDRESSES WITH THEIR CHANGE ADDRESSES

    -- IDENTIFYING TX WITH 1 INPUTS AND 2 OUTPUTS

    with first_selection as ( SELECT tx_hash, input_count, output_count from bitcoin.core.fact_transactions where input_count = 1 and output_count = 2 ),

    -- ADDING OUTPUTS WITH THEIR VALUES

    outputs_first_selection as ( SELECT a.tx_hash, b.PUBKEY_SCRIPT_ADDRESS as output, b.value from first_selection a left join bitcoin.core.fact_outputs b on a.tx_hash = b.tx_id ),

    -- RECEIVERS LIST. VALUE <= 4 DECIMALS

    receivers as ( SELECT tx_hash, output as receiver, value from outputs_first_selection where length(substring(CAST(value AS STRING), position('.', CAST(value AS STRING))+1)) <= 4
    ),

    -- CHANGES LIST. VALUE > 4 DECIMALS

    changes as ( SELECT tx_hash, output as change, value from outputs_first_selection where length(substring(CAST(value AS STRING), position('.', CAST(value AS STRING))+1)) > 4
    ),

    --PICKING ONLY TXS WITH ONE RECEIVER AND ONE CHANGE

    changes_filtered as ( SELECT c.tx_hash, change from changes c inner join receivers r on c.tx_hash = r.tx_hash )

    -- CLUSTERING TOGETHER INPUTS WITH CHANGES (SAME INPUT SAME ID)

    select distinct --ch.tx_hash, i.pubkey_script_address as input, ch.change, dense_rank() over(order by input) as cluster_id from changes_filtered ch left join bitcoin.core.fact_inputs i on ch.tx_hash = i.spent_tx_id having change != input ORDER BY 1