the raw code

Status
Not open for further replies.
I dont understand your interest of using an array. Arrays can't be offloaded to a text file as fast as one variable can. If you save arrays to disk it looks like this. Array1=1131 Array2=13113 after save it looks like this 113113113 I think I would need some of that pixi dust to make the data reference a line of code.
Not making fun, its not that i dont use arrays it doesn't apply in this area of my scripts. Not your fault I'm not giving much for you to work with. Possible way to fix the array conundrum when saving to a text file is. To add CL LF characters into the array that is if its a string not integer before saving. That way data would look like this
1131
13113
TM
 
^exactly. And a variable, with no delimiters, would look exactly the same. How can you work with something like that?
 
Hash is formated with delimitation or containers, so you can store a integer in a string and use a logical to check one serial against many. If the current serial is not contained in serial list append serial to serial to list and store the data it refers to.
 
So every time you hash a line you add an end character? If I had to work with the data later, I'd definitely load it into the program with some sort of array. But yeah, maybe just for compiling a quick list and saving it a simple variable could be the best option.
I'm just not sure how much faster it would be to save the data into a text file, taking the delimiters into account, from a variable rather than a vector. Care to run a quick test? I'd be very interested in the result :)
 
I dont understand your interest of using an array. Arrays can't be offloaded to a text file as fast as one variable can.
Vectors, not arrays. Also, it only takes a few lines of code to offload a vector into a text file any way you want. The thing about a vector is that you even have the option to easily pick out any line of code you want and do whatever you want with it.

On second thought, though, I might use an std::map instead of a vector. With a map, you can just assign each hash code its own value, which would be a string with the line of code associated with that hash code. Then you could immediately look up a line of code by the hash code. With a map, you can easily delete any member anywhere throughout the map, instead of just popping one off the end, like a vector.

But if you don't need to actually access any of the data, one string variable is what works for you. I'm just thinking about this from the perspective of why I, personally, would even want to do this at all. Personally, I would create a solution like this so that I could easily access a line of code later. That's obviously not your goal.
 
That's nice, if I was monitoring the data being collected and tabled. But I have a job. This is for a web mapper or downloader wateva. Data is not known so a vector could crash the program if data is exceeds the ram. The reason for my methods are because multiple scripts with individual file systems are processing areas of the task. Passing variable from one script to the other is not a good idea because one script may be waiting for a wile and sits in ram. I hear what your saying but the hash function i implemented for a lookup TRUE or FALSE only would be pointless for a simple logic.
 
Data is not known so a vector could crash the program if data is exceeds the ram.
That makes sense. If you're only purpose is to download the code to a file and you have to be mindful of RAM with that much data, then maps and vectors aren't the way to go. But putting it all in one string variable can exceed your RAM, too. To prevent that, you'd probably have to bypass the variable altogether and save it line by line directly to a file. But that would slow you down very much.
 
the crawler is coming along nicely. Iv programmed a command script using a few open source experimental ftp and http downloader's. iv compiled a script in my program to format 8 batch files to download source code from links it hasn't found in the database to sequentially download the links and save to a text file directory for processing. Being iv made 8 command scripts, the download process takes very little time to gather a sites text base content. Forums up to 600 000 to a few million pages were done in a few hours. I have not been given any warnings that my crawler is taking to much bandwidth or misbehaving. I think crawlers that are aimlessly downloading all content like java script and pictures is bad programming. Iv been able to crawl and build large text base search databases from just source code and a new data processing method. If my crawler can go through large forums and news websites that quick being a few hours and I haven't started to work on the crawling method yet, don't mean to boast but I have hopes that this will be the fastest and domain friendly crawler or search engine. The downloader scripts can spread over hundreds of downloader's to feed my hungry database when im finished the the hole thing. I don't know if I should say this but I have even figured out how to bypass php encrypted sites. So hidden links that are apparently imposable to craw for those who don't know what I mean. Some news and forum websites that don't want people to crawl and database there content use switches to hide links from the source code. Switches and cases are used to communicate hidden code to the browser and negotiate the integrity of the requested host. Well I won't share the method because that would be malicious but I think I should warn owners of a forum or news domain that I can do it and have. Im not hacking sites because im not going passed any rights that the domain hasn't given my browser. The only thing I am guilty of is seeing passed the bullsh** that coders and companies put in there source code to further control information and gather profiles on people to sell them a new car or a better long distance telephone service. I wonder how these companies would feel to know there entire forum (((the present excluded because I like the idea of sharing ideas and helping each other with tech stuff)) and or news websites are cataloged and searchable. What if this information was to be given to there local church in the area to look through and know what information there kids had access to. Then we will see how they feel being woken up at 3 am in the morning to a nice old lady asking if there happy with the idea that there going to **** because of the information they allow in there website. That's the issue here, there are people that wish to filter the internet but cant because of this sneaky information control, which is why I decided to do this. My search engine is not chewing bandwidth or misbehaving in any way.
The only thing im currently slowed by is splitting code into multiple files. So code that I need to distribute into separate scripts sometimes slows down to the point where I need to restart that process. I take the line count being total number of lines and divide by an integer which could be whateva number I choose depending on how fast I need the next process to function. I use a text process to go line by line and store the directed amount into a file a when it has reached the amount it starts passing to the next being b. I do use a variable to cache a few thousand lines before storing to prevent excessive hard drive calling. I don't know if an array could solve this but im open for suggestions if peoples have a better way. Being dividing code into multiple files, which I can do in about 15 mins for 60 something million. If there is a faster way this process can be done please help.
TM
 
enter the concept of multi threading.

A Simple Crawler Using C# Sockets - CodeProject



WebCrawlerArchitecture.png





edit - also, it seems using hashes to create unique values could be problomatic too.


Hash table - Wikipedia, the free encyclopedia

Collision resolution

Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For example, if 2500 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday problem there is a 95% chance of at least two of the keys being hashed to the same slot.

Therefore, most hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the table, together with the associated values.
 
Status
Not open for further replies.
Back
Top Bottom