Vlad
From what I understand:
For each post, a specific instance hosts that post. That instance should have all content associated with that post (comments, etc.). Other instances mirror content that gets sent to them, but those are only mirrors and any replies to it are sent back to the host instance.
For example, this post is hosted on programming.dev, but is mirrored on Beehaw so that Beehaw users can reply to it with their Beehaw accounts. The source of truth for this post is the programming.dev instance though.
If you're crawling Lemmy posts, I highly recommend only looking at the instance the post is hosted on, because instances may choose to defederate from each other, and once defederated, content is not shared between those two instances. For example, using my post above, a user from lemmy.world may choose to reply to that post from the lemmy.world instance, and that reply would appear in programming.dev. However, because Beehaw (currently) is defederated from lemmy.world, the Beehaw mirror would not see that reply. Here's a post on Beehaw which talks about how defederating works, if you're interested, but generally speaking if only the post on its host instance was indexed then I think that would be fine.
As for finding which instance is the host instance, there might be a way using one of the APIs. From HTML, there's usually a button that looks like this, but the UIs may differ between instances (and at time of posting this, do actually differ between Beehaw and programming.dev):
I'm not as familiar with other federated services like Kbin and Mastodon, but I would imagine it's similar in that you'd want to look at the host's version of a post as the source-of-truth. Kbin is also able to interact with Lemmy through "Threads" and Mastodon (I believe) through "Microblogs".