|
|||||||||
|
|
|||||||||
| About the Project: | |||||||||
|
It is a desktop application which scraps the updated articles which is published in yahoo website by article published date and time. It starts scraping when starting-time, ending-time and delay-time has been given. This application has control file which has list of stock name which are to be scraped. Search articles for specific stock-name and creates stock and article file for that article. After that it will update the control file (date and article title).For each publisher it will create a new folder and stores the article with name of publisher and date |
|||||||||
| Flow for the project “Yahoo scraper”: | |||||||||
|
1. Get delay-time, start date-time and end date-time 2. • Check start date-time and end date-time. If current date-time is between those date-times then continue the work. If current time exceeds end date-time then exit. • If delay-time is not given then it will take the default value. The default delay-time is 60 seconds • If the starting date-time is not given then it will take the default value. The default value for starting date-time is current date-time. • If the ending date-time is not given then it will execute all the stocks once. After completing that it will exit. 3. Open control file 4. copy the 1st line to a variable 5. From that variable remove the white space & save it 6. count the no of lines starting from the 2nd line to find how many stocks are present 7. Loop starts here ( It will be executed up to the end-of-file) 8. read first 10 chars & store it in a variable 9. remove space & save it 10. store 11th and 12th chars in separate variables 11. 13th to 28th chars are date and time. store it in a variable 12. Other chars are article title (last 40 characters) 13. Check 1st stock’s status is yes then checks the date. If the stock doesn’t have then take the date given in first line(which is stored in a variable) 14. Send stock name, look-back-date and release-time to webharvest 15. Check delay-time 16. Using the three inputs from c++ we’ll find the following in webharvest: article content, article title, publisher, article date-time 17. Create a stock file X”STOCK NAME”. DAT Ex: XIBM.DAT 18. First line of the stock file is empty 19. 2nd line: 1st and 2nd chars are empty 20. Date and time of article (not done) 21. 2nd line: 80 chars are article title 22. After that 20 chars are article publisher 23. After that 30 chars are article’s filename without extension 24. Create article file W”stock name””Yr””Month””Date””release-time” Ex: WIBM0810231428A.ART 25. First line is empty 26. Rest of them are text of the article 27. Update control file (look-back-date, article title) 28. check the article’s look-back-date 29. End of loop 30. Check current time with end-time. If current time greater than or equal to end time then quit the process |
|||||||||
|
|