Everything, Everything

2021: January
2020: J F M A M J J A S O N D
2019: J F M A M J J A S O N D
2018: J F M A M J J A S O N D
2017: J F M A M J J A S O N D
2016: J F M A M J J A S O N D
2015: J F M A M J J A S O N D
2014: J F M A M J J A S O N D
2013: J F M A M J J A S O N D
2012: J F M A M J J A S O N D
2011: J F M A M J J A S O N D
2010: J F M A M J J A S O N D
2009: J F M A M J J A S O N D
2008: J F M A M J J A S O N D
2007: J F M A M J J A S O N D
2006: J F M A M J J A S O N D
2005: J F M A M J J A S O N D
2004: J F M A M J J A S O N D
BitTorrent And SET
Thursday 12th April, 2007 23:40 Comments: 0
Movies and music could be shared faster over the net thanks to a system pioneered by researchers in the US. The findings are outlined in a paper, Exploiting Similarity for Multi-Source Downloads Using File Handprints, written by David Andersen of Carnegie Mellon University, Himabindu Pucha, of Purdue University, and Michael Kaminsky of Intel Research.

Current file-sharing systems, like BitTorrent work best when there are multiple sources of a specific shared file. When a file is shared it is divided into chunks and distributed to groups of people who are searching for that file. The more sources of those chunks there are, the more information there is that can be sent to a user, resulting in faster download speeds. But these services often fail to deliver fast speeds because there are not enough users sharing the chunks of a specific file.

"A big limitation of BitTorrent is that it only lets clients share data if they're downloading the exact same file," said Professor Andersen. "This means that the available client pool for any particular file is smaller than it needs to be."

Similarity-Enhanced Transfer (SET) works by spotting chunks of identical data in files that are an exact or near match to the one needed. The trio realised that many files being shared on the net contain identical pieces of data even though they appear to be different, resulting in faster speeds when SET is used. Professor Andersen said he was "shocked" by this discovery. I'm very surprised he was shocked, it seems quite sensible and logical to me, but I can also see why it's not worth implementing.

A lot of torrents contain the same files, e.g.

linux-distro.iso

linux-distro.iso
linux-distro.md5
linux-distro-readme.txt

Both contain the main "linux-distro.iso" file, but because of the additional files in the second torrent, the resulting infohash is different, so they aren't a match.

naughty-file.avi
naughty-group-file.nfo
naughty-tracker-file.txt

naughty-file.avi
naughty-group2-file.nfo

In this second scenario, both contain the same naughty file, which takes up most of the torrent, but (for the reason stated above) there is currently no way they can work together, even using DHT, as they have a completely different infohash. Even having a different filename (I think, as I believe it's a hash based on the info value from the metainfo) or ID3 tag information will make a difference to the overall infohash.

But, if they use the same size pieces and it starts with the largest file first, there's a good chance that the majority of the torrent will be identical (ID3v2 information is stored at the start of the MP3, so a different ID3v2 tag would cause problems, but if it just had an ID3v1 tag at the end of the MP3 then you'd probably be able to grab most of the first file okay). Although you can't tell that based on the (overall SHA1) infohash of the file that's sent to the tracker, each piece is also hashed and compared against a SHA1 hash associated with it to ensure that the downloaded data is identical. So you could run another DHT-type tracker that will perform the same task based on the pieces, rather than the overall infohash. As long as the overhead of DHT data for pieces doesn't use significantly more bandwidth, it could be very useful for poorly seeded material. But most legal content is, or should be, well seeded. And ideally you'd want to download your linux distro from the official website, as it's less likely to have been altered (in a very bad way, e.g. including a backdoor in a service) than the similarly named download at some random website.

Once you start looking at illegal side of things, if an anti-piracy group obtain one of those naughty files, there's a good chance that asking the DHT-type tracker for the IP address of everyone else that has one piece that's the same would result in it quickly giving up every single illegal filesharer, no matter where they grabbed their torrent from. The only thing I can think of to counteract that service would be a private tracker, as that shouldn't return the information, so you would probably see an increase in the number of private trackers, forcing piracy further underground (which might be what the MPAA/RIAA want), and making illegal content poorly seeded (or even more poorly seeded?). Or you see more odd sized rar files being seeded to make SET utterly pointless (and malicious people might seed fake content within password protected rar files etc. to annoy people that illegally download content). Or the piece size goes crazily small, so it's harder for the anti-piracy companies to say you've got the illegal file because of just one piece, although if you have 99% number of all the pieces they check for, it might be a safe bet that you've got the same file, but it might also be difficult to prove in court as they can't prove you're sharing the complete file as long as one piece at the end doesn't match up.

In my opinion, SET is a good idea for poorly seeded legal content, but there shouldn't be that much around. It's an interesting idea, but there might be privacy problems, and for well seeded content there would be very little gain. Probably not enough to warrant the extra overhead. Perhaps it'd be possible to allow it as an option that can be enabled in the client and only kicks in when you're very low on seeds. But I'd rather not see it implemented.
© Robert Nicholls 2002-2021
The views and opinions expressed on this site do not represent the views of my employer.
HTML5 / CSS3